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Artificial  neural  networks  are  a  computational  framework  that  has  become  a  focus 
of  widespread  interest.  One  of  the  most  widely  used  neural  networks  is  the  feedfor- 
ward neural  network  (FNN).  This  type  of  neural  network  can  be  used  to  learn  the 
underlying  rules  from  examples.  This  learning  ability  enables  FNNs  to  have  wide  ap- 
plicability. However,  the  theory  behind  this  neural  network  model  is  still  immature. 
There  are  many  deficiencies  of  the  current,  neural  network  learning  algorithms  that 
have  hindered  their  usefulness. 

In  this  dissertation,  we  surveyed  the  research  in  FNN  learning.  Several  new  algo- 
rithms are  proposed  to  improve  the  learning  efficiency  of  FNNs.  We  have  developed 
a  globally  guided  neural  network  training  algorithm  that  converges  to  a  global  opti- 
mal solution  and  reduces  the  training  time.  Both  stochastic  and  deterministic  global 
optimization  approaches  are  employed  for  neural  network  training.  The  stochastic 
methods  include  genetic  algorithms,  simulated  annealing,  and  pure  random  searches. 
Deterministic  methods  considered  for  neural  net  training  are  branch-and-bound  based 
Lipschitz  optimizations.  By  exploring  the  special  structure  of  the  FNN  and  the  prop- 
erty of  the  sigmoid  activation  function,  we  developed  procedures  for  computing  Lip- 
schitz constants  over  subsets  of  the  weight  space.  With  local  Lipschitz  constants  we 
can  identify  weight  regions  that  do  not  contain  promising  solutions  and  develop  prun- 
ing methods  that  reduce  the  search  space.  The  main  advantage  of  the  global  optimal 


training  algorithms  (GOTA)  is  that  they  yield  a  guaranteed  global  optimal  solution. 
GOTA  can  also  be  combined  with  local  search  procedures,  such  as  backpropagation, 
to  produce  more  efficient,  but  still  globally  convergent  algorithms. 


CHAPTER  1 
INTRODUCTION 

Artificial  neural  networks  (neural  networks  or  neural  nets  for  short)  are  a  com- 
putational framework  that  has  recently  become  a  focus  of  widespread  interest.  In 
contrast  to  conventional  centralized,  sequential  processing,  neural  networks  consist 
of  massively  connected  simple  processing  units,  which  are  analogous  to  the  neurons 
in  the  biological  brain.  Through  elementary  local  interactions  (such  as  excitatory 
and  inhibitory)  among  these  simple  processing  units,  sophisticated  global  behaviors, 
which  resemble  the  high-level  recognition  process  of  humans,  emerge. 

Information  in  a  neural  network  is  distributed  across  many  processing  units  and 
the  connections  among  them,  rather  than  stored  in  a  single  location.  The  process- 
ing units  act  in  parallel  and  communicate  only  with  their  local  peers.  This  makes 
high-speed  computation  readily  achievable  through  parallel  computers.  The  parallel 
and  distributed  processing(PDP)  computational  paradigm  exhibits  many  desirable 
features,  such  as  fault  tolerance  (resistance  to  hardware  failure),  robustness  in  han- 
dling different  types  of  data,  graceful  degradation  (being  able  to  process  noisy  or 
incomplete  information)  (Matheus  and  Hohensee,  1987),  and  the  ability  to  learn  and 
adapt  (Rumelhart  et  al.,  1986;  Lippmann,  1987;  Hinton,  1989). 

Research  in  neural  nets  experienced  a  sudden  resurgence  in  the  early  1980s  and 
has  seen  an  explosive  growth  in  the  last  few  years.  The  excitement  about  neural 
nets  is  rooted  in  understanding  information  processing  in  human  brains.  But  recent 
interest  in  neural  network  study  has  grown  to  cover  a  wide  spectrum  of  areas  from 
industry,  to  education,  to  business,  to  the  military  (Simpson,  1990). 

Major  interest  in  neural  networks  has  shifted  from  their  biological  association 
toward  their  utility  as  powerful  and  flexible  computational  frameworks.  Recent  the- 
oretical study  in  neural  nets  has  brought  about  encouraging  results  (White,  1989; 


Hornik  et  al.,  1990),  although  with  regard  to  the  whole  area  a  sound  theoretic  foun- 
dation has  yet  to  be  established.  The  fast  growth  of  this  area  has  been  pushed  by 
extensive  applications  of  the  neural  net  computation  paradigm.  By  virtue  of  their  in- 
herent parallel  and  distributed  processing,  neural  nets  have  been  shown  to  be  able  to 
perform  tasks  that  are  extremely  difficult  for  conventional  von  Neumann  machines, 
but  are  easy  for  humans?.  These  tasks  include  image  recognition  (Carpenter  and 
Grossberg,  1987)  and  speech  processing  (Sejnowski  and  Rosenberg,  1987).  More  im- 
portantly, neural  nets  have  been  successfully  applied  to  solve  problems  that  often 
require  human  experts,  such  as  sun  spot  prediction  (Weigend  et  al.,  1990)  and  ERP1 
recognition  (DasGupta  et  al..  1990). 

In  the  business  world,  neural  networks  have  been  successfully  applied  to  areas 
where  traditional  approaches  are  ineffective  or  inefficient  to  use.  A  partial  list  of 
such  areas  include  loan  evaluation  (Judge,  1989),  signature  recognition  (Rochester, 
1990),  stock  market  prediction  (Dutta  and  Shekhar,  1988),  time  series  forecasting 
(Sharda  and  Patil,  1990),  and  classification  analysis  (Fisher  and  McKusick,  1989; 
Singleton  and  Surkan,  1990). 

The  leading  neural  net  paradigm  for  applications  is  the  feedforward  neural  net 
(FNN).  An  FNN  is  used  by  first  training  it  with  known  examples.  Once  the  network 
is  trained  successfully,  or  in  other  words  the  neural  net  has  learned  the  concept/rule 
embedded  in  the  training  examples,  it  can  be  used  to  recognize  an  associated  out- 
come given  an  input  it  has  seen  before.  The  trained  neural  net  can  also  be  used  to 
estimate/predict  a  possible  outcome  when  a  novel  input  is  presented. 

A  neural  net  training  procedure  is  also  called  a  learning  algorithm.  One  of  the 
most  widely  (and  wildly)  used  neural  net  learning  algorithms  is  the  backpropagation 
(BP)  procedure  (le  Cun,  1988;  Rumelhart  et  al.,  1986).  Although  BP  has  con- 
tributed to  many  successful  stories,  this  learning  procedure  lacks  a  sound  theoretic 
background.  Backpropagation  is  essentially  a  simple  gradient  descent  based  search 
algorithm.  Consequently,  it  has  the  well  recognized  deficiencies  of  simple  gradient 
descent.  The  learning  process  is  usually  very  slow,  and  the  final  solution  is  likely  to 

1  Event  Related  Potential  (ERP)  is  a  measure  of  brain  response  to  sensory  stimuli.  ERP  has  been 
related  to  human  performance  in  a  given  environment  (see  DasGupta  et  al.,  1990). 


be  a  local  minimum  solution  if  the  training  problem  has  multiple  minima,  which  is 
often  true.  Furthermore,  the  BP  algorithm  as  used  in  practice  deviates  from  strict 
gradient  descent.  This  deviation  may  reduce  the  likelihood  of  a  solution  trapped  in 
a  unsatisfactory  local  minimum.  However,  the  convergence  of  the  procedure  has  now 
become  an  open  question  in  theory. 

Other  shortcomings  of  the  backpropagation  algorithm  include  a  static  (fixed  a 
priori)  neural  network  structure,  ad  hoc  choice  of  learning  parameters,  and  sensitiv- 
ity to  initial  conditions  (weight  values).  Because  of  these  limitations,  feedforward 
neural  nets  trained  with  the  BP  algorithm  reach  only  a  suboptimal  status.  The  gen- 
eralization ability  of  the  neural  nets,  an  ability  to  function  in  a  domain  larger  than 
the  training  set,  is  also  limited. 

Extensive  research  has  been  carried  out  in  recent  years  to  explore  the  poten- 
tial of  feedforward  neural  nets  and  to  improve  the  effectiveness  and  efficiency  of  the 
backpropagation  learning  procedure  (Jacobs,  1988;  Becker  and  le  Gun,  1988;  Moller, 
1990).  Remarkable  progress  has  been  made  in  developing  new  training  methods  and 
neural  network  architectures  (Fahlman,  1989;  Fahlman  and  Lebiere,  1990;  Chan  and 
Shatin,  1990).  However,  most  variations  of  the  backpropagation  algorithm  are  based 
on  heuristics  that  reduce  the  generality  of  the  approach.  For  example,  Fahlman's 
cascade  correlation  algorithm  is  orders  of  magnitude  faster  than  the  classic  back- 
propagation  algorithm,  but  its  application  is  limited  to  input-output  mappings  with 
binary  outputs.  Much  less  work  has  been  done  in  overcoming  the  problem  of  local 
minima.  A  few  researcher  have  used  stochastic  global  search  methods  in  neural  net 
training  with  moderate  success  (Montana  and  Davis,  1989;  Fang  and  Li,  1991).  To 
date,  we  have  seen  no  reports  that  apply  deterministic  global  optimization  approaches 
to  neural  net  training. 

Compared  with  a  vast  volume  of  applications,  theoretic  study  on  neural  nets 
(in  particular,  the  backpropagation  learning  algorithm)  has  been  weak  at  best.  As  a 
result  of  the  lack  of  theoretic  guidelines,  reports  on  the  applicability  and  performance 
of  neural  net  training  algorithms  are  often  inconsistent.  What  adds  to  the  confusion 
of  the  area  is  the  lack  of  coherent  terminologies.     Concepts  such  as  convergence 


and  generalization  are  constantly  referred  to  without  precise  definitions.  There  is 
apparently  a  need  for  unified  definitions  and  formalism  of  the  FNN  learning  paradigm. 
In  this  dissertation,  we  attempt  to  fill  this  need  and  address  the  problems  associ- 
ated with  backpropagation  learning,  with  a  focus  on  developing  efficient  and  globally 
convergent  learning  algorithms.  Our  approaches  involve  both  stochastic  and  deter- 
ministic global  optimization  techniques.  We  propose  to  treat  neural  network  training 
as  a  global  optimization  problem.  Recent  development  in  global  optimization  re- 
search lends  us  some  viable  tools,  such  as  branch-and-bound  method  and  Lipschitz 
optimization  (Horst  and  Tuy,  1990).  We  also  consider  globally  guided  heuristic  search 
methods. 

The  dissertation  is  composed  of  nine  chapters.  Following  the  introduction.  Chap- 
ter 2  presents  a  general  account  of  neural  networks,  an  outline  of  the  historical  de- 
velopment of  neural  net  research,  and  a  more  detailed  discussion  of  the  promise  and 
problems  of  current  neural  network  study.  Chapter  3  gives  the  basic  concepts  and 
definitions  of  feedforward  neural  nets.  The  backpropagation  algorithm  is  derived  and 
discussed  in  detail  regarding  its  learning  mechanism,  the  applicability  and  limitations, 
and  implementation.  The  next  chapter  (Chapter  4)  focuses  on  the  improvement  of 
the  backpropagation  learning  algorithm.  A  variety  of  approaches  is  presented,  rang- 
ing from  using  efficient  optimization  procedures,  to  designing  new  network  structures, 
to  dynamically  adapting  learning  parameters  and  learning  mechanisms.  This  chapter 
summarizes  the  state-of-the-art  research  in  feedforward  neural  network  training. 

Chapter  5  begins  our  work  on  globally  convergent  neural  network  learning  pro- 
cedures. We  develop  a  search  method  that  uses  the  information  in  neural  network 
output  space  to  guide  the  learning  process,  rather  than  search  in  the  complicated 
weight  space  following  the  gradient  descent.  We  explore  the  application  of  stochastic 
global  optimization  methods  in  neural  network  training  in  Chapter  6.  In  particular, 
we  discuss  the  use  of  genetic  algorithms,  simulated  annealing,  pure  random  search 
and  clustering  random  search  methods. 

Chapter  7  deals  with  neural  net  training  with  deterministic  global  optimization 
approaches.  We  concentrate  on  the  application  of  the  branch-and-bound  framework 
for  global  optimization.  Lipschitz  continuity  of  the  global  criterion  function  is  used 


in  obtaining  lower  bounds  of  the  branch-and-bound  procedure,  through  an  extension 
of  the  univariate  Piyavskii  algorithm.  Upper  bounds  can  be  obtained  with  or  without 
local  search  in  the  partition  elements.  A  procedure  is  developed  to  compute  local 
Lipschitz  constant  over  subsets  of  the  weight  space.  This  leads  to  tighter  lower  bounds 
and  more  effective  pruning  in  the  branching  search  process. 

The  implementation  of  the  global  optimization  training  algorithm  (GOTA)  is  dis- 
cussed in  Chapter  8.  We  show  that  the  computation  of  the  local  Lipschitz  constant  is 
easily  carried  out  by  exploring  the  special  structure  of  the  feedforward  neural  network 
and  the  property  of  the  sigmoid  activation  function.  We  also  discuss  the  simulation 
program  design  and  different  search  strategies  under  the  general  framework  of  GOTA. 
Experiments  on  the  effectiveness  of  GOTA  and  its  local  search  augmented  version 
(LGOTA)  are  carried  out  with  some  standard  benchmark  problems. 

Finally,  in  Chapter  9,  we  summarize  our  contribution  in  the  dissertation.  Several 
conclusions  are  reached  based  on  our  theoretical  study  and  experimental  investiga- 
tion. Further  extensions  of  this  research  are  also  discussed. 


CHAPTER  2 
THE  RENAISSANCE  OF  NEURAL  NETWORKS 

Let's  face  it,  beyond  part  of  the  interest  in  connectionism  is  that  dirty 
little  secret  that  researchers  in  nuclear  physics  had  during  the  thirties— 
that  maybe  you  can  build  something  with  it. 

—  Gary  Lynch1 

After  more  than  a  decade  of  dormancy,  research  in  artificial  neural  networks  came 
back  to  life  in  the  80's  and  experienced  an  explosive  growth  in  recent  years.  The  new 
surge  of  enthusiasm  resembles  the  initial  excitement  in  neural  nets  in  the  late  5G"s  and 
the  early  60's,  only  far  more  intensive  and  extensive.  The  wave  of  neural  network 
research  has  engulfed  widespread  disciplines:  neuroscience,  psychology,  linguistics, 
computer  science,  engineering,  mathematics,  and  decision  sciences.  In  fact,  the  ma- 
jority of  neural  net  research  has  gone  so  far  as  to  have  totally  lost  any  traces  to  their 
biological  roots.  Thus  when  we  quote  Gary  Lynch,  a  well-known  neuroscientist,  we 
do  not  really  mean  that  we  are  going  to  build  an  electronic  brain,  rather,  we  mean 
to  build  "something"  that  will  enable  us  to  solve  problems  that  are  intractable  or 
difficult  to  solve  with  conventional  approaches. 

2.1      Overview  of  Neural  Networks 

As  a  reflection  of  the  relative  youth  and  broad  scope  of  this  field,  neural  networks 
are  known  by  various  names  such  as  adaptive  systems,  connectionist  machines,  neu- 
rocomputers,  collective  decision  circuits,  parallel  distributed  processors  and  neuro- 
morphic  systems  (Lippmann,  1987;  Knight,  1990).  There  are  as  many,  if  not  more, 
varied  and  sometimes  esoteric  definitions,  from  simple  such  ones  as:  A  model  '-com- 
posed of  many  nonlinear  computational  elements  operating  in  parallel  and  arranged 
Adopted  from  Allman  (1989). 


in  patterns  reminiscent  of  biological  neural  nets"  (Lippmann,  1987,  p.    4)  to  more 

complicated  and  specific  ones  such  as: 

A  parallel,  distributed  information  processing  structure  consisting  of  pro- 
cessing elements  (which  can  possess  a  local  memory  and  can  carry  out 
localized  information  processing  operations)  interconnected  with  unidi- 
rectional signal  channels  called  connections,  each  processing  element  of 
which  has  a  single  output  connection  which  branches  out  into  as  many 
collateral  connections  as  desired  with  each  carrying  the  same  signal,  that 
being  of  any  mathematical  type  desired  (the  processing  being  local  to  the 
processing  element,  i.e.,  dependent  only  on  the  current  values  stored  in 
the  processing  element's  local  memory).  (Hecht-Nielson,  1989,  p.  593) 

By  the  very  fact  that  research  in  neural  networks  was  only  revived  recently  and  has 
found  its  way  into  such  a  diversified  spectrum  of  disciplines,  it  is  hard  to  give  neural 
nets  a  generic  and  concise  definition.  But  it  is  generally  agreed  that  the  essence  of 
neural  nets  is  parallel  distributed  processing  (PDP)  (Rumelhart,  McClelland  and  the 
PDP  Group,  1986). 

Originally,  neural  nets  were  biologically  motivated.    However,  research  in  this 
field  has  long  (well,  relatively  long)  diverged  into  two  directions.    One  branch  en- 
deavors to  understand  our  very  brain.  Researchers  in  this  branch  are  concerned  with 
human  perception,  memory,  reasoning,  and  learning.  The  other  branch  is  more  inter- 
ested in  the  computational  models  and  the  power  to  accomplish  traditionally  difficult 
tasks,  rather  than  biological  fidelity.  The  main  thrust  of  current  neural  net  research 
seems  tilted  towards  the  second  area.    There  are  more  than  a  dozen  main  neural 
net  paradigms  being  actively  applied  today  (Simpson,  1990).  The  most  widely  used 
one  is  the  backpropagation  (BP)  model  (le  Cun,  1988).  BP  bears  little  resemblance 
to  biological  systems.  The  popularity  of  BP  arises  from  its  simplicity  and  powerful 
representation  ability  that  can  address  a  wide  variety  of  real  world  problems.  Other 
neural  net  models  that  do  not  have  much  biological  flavor  but  find  successful  ap- 
plications in  pattern  recognition,  decision  making  and  optimization  include  Hopfield 
networks  (Hopfield,  1982)  and  Kohonen's  self-organizing  networks  (Kohonen,  1989). 
Although  the  neural  network  models  van'  just  as  the  tasks  for  which  they  are  model- 
ing, they  share  some  general  features  under  the  generic  name  neural  network.  They 
are  characterized  by  the  following: 

1.  A  set  of  simple  processing  units  (neurons) 


2.  Massive  inter-neuron  connections  and  associations  via  those  connections 

3.  High  parallel  processing. 

4.  Internal  information  representation  and  distributed  storage  (as  weights  on  the 
connections  and/or  the  activation  states  of  the  neurons) 

5.  A  learning  rule  whereby  the  internal  representation  is  changed  in  response  to 
the  changes  in  the  environment 

G.  A  learning  environment  that  provides  input  and  feedback  to  the  network 

The  basic  characteristics  of  an  artificial  neural  network  are  similar  to  its  biological 
counterpart.  But  for  most  neural  network  paradigms,  the  learning  mechanisms  do 
not  even  remotely  resemble  the  learning  mechanism  in  biological  systems.  Neverthe- 
less, neural  networks  provide  a  framework  within  which  certain  aspects  of  the  human 
brain  can  be  modeled.  Those  aspects  include  association,  classification,  generaliza- 
tion, optimization  (under  soft  constraints)  and  adaptation.  In  large  part,  intelligent 
systems  (artificial  or  natural)  depend  on  those  abilities,  and  those  abilities  are  not 
easily  modeled  with  conventional  serial  processing  models  based  on  von  Neumann 
machines.  The  structural  and  nonprogramming  approach  of  neural  networks  lend 
themselves  to  deal  with  difficult  artificial  intelligence  (AI)  problems  such  as  pattern 
recognition  problems.  While  it  is  often  difficult  or  impossible  to  explicitly  write  down 
a  set  of  rules  for  such  problems  (hence  symbolic  approaches  fail),  neural  networks  can 
learn  from  training  data  to  produce  a  solution.  In  recent  years  neural  networks  have 
made  strong  advances  in  AI  areas  (Caudill,  1989). 

Conventional  expert  system  inferences  slow  down  with  an  increase  in  their  knowl- 
edge base.  This  is  counterintuitive.  Humans  get.  faster  as  we  possess  more  knowledge 
about  the  problem  domain.  This  deficiency  in  expert  systems  is  due  to  the  se- 
quential search  nature  of  the  inference  mechanism.  This  problem  is  alleviated  with 
neural  networks.  Parallel  and  distributed  processing  enables  neural  networks  to  pro- 
cess/retrieve large  amounts  of  information  at  high  speed.  Parallel  processing  leads 
to  interaction  and  distribution  leads  to  association.  Distributed  storage  is  also  called 
content-addressable  memory  (Beale,  1990),  where  an  entire  (complex)  pattern  can 
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be  retrieved  by  using  any  part  of  it  as  a  key  (Rumelhart,  McClelland  and  the  PDP 
Group,  1986). 

The  neural  network  paradigm  makes  itself  easily  adaptive.  This  ability  is  essential 
in  a  dynamic  environment.  Some  neural  network  models  have  been  shown  to  be  equiv- 
alent to  statistical  classifiers  (White,  1989).  Compared  with  statistical  approaches, 
neural  networks  have  the  advantages  of  robustness,  by  virtue  of  their  distributed 
representation  and  adaptation.  Also,  neural  networks  make  little  or  no  assumptions 
concerning  the  underlying  distribution  of  the  training  data.  They  may  be  applied  to 
data  sets  generated  by  non-Gaussian  processes  where  traditional  statistical  methods 
cease  to  be  effective  (Lippmann,  1987). 

In  a  distributed  processing  system,  the  job  is  done  by  the  joint  effort  of  many 
processing  units.  If  one  or  a  few  of  those  units  fail,  they  do  not  significantly  affect 
the  performance  of  other  processing  units  and  the  system  as  a  whole  still  works.  This 
property  is  known  as  fault  tolerance,  which  is  not  shared  by  traditional  computing 
paradigms.  The  human  brain  presents  an  excellent  example  of  fault  tolerance  where 
some  neurons  die  out  daily  and  the  brain  keeps  functioning  in  every  practical  sense. 
On  the  contrary,  a  serial  processing  machine  comes  to  a  complete  halt  with  a  failure 
in  virtually  any  part  of  it.  Even  with  continued  damage  to  the  processing  units,  a 
distributed  system  has  '"graceful  degradation:"  That  is,  the  system's  performance 
deteriorates  gradually,  rather  than  with  a  catastrophic  breakdown. 

2.2     Historical  Development 

The  study  of  neural  networks  has  a  long  and  colorful  history.  Pioneering  work 
on  neural  nets  dates  back  to  the  early  1940s  when  McCulloch  and  Pitts  (1943/1988) 
proposed  that  the  brain,  as  a  computing  device,  consists  of  simple  processing  units 
(neurons).  They  built  a  simple,  yet  elegant,  model  of  a  neuron  (later  known  as  a 
McCulloch-Pitts  neuron  or  simply  an  M-P  neuron)  in  which  a  well-defined  process 
determines  the  activation  levels  of  the  neuron  based  on  the  stimulus  it  receives  from 
its  environment.  The  M-P  neurons  have  powerful  capability  since  it  can  be  shown 
that  an  arbitrary  Boolean  function  can  be  implemented  with  them  (Murphy,  1990). 
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The  basic  structure  and  operations  of  the  M-P  neuron  can  still  be  found  in  some  of 
today's  neural  network  models. 

The  M-P  neurons  provide  a  model  of  computation  that  enables  the  idea  of  con- 
nectionism.  The  activations  of  the  neurons  are  determined  by  the  combined  effects  of 
incoming  excitatory  and  inhibitory  stimuli.  But  nothing  was  known  about  how  the 
connection  strength  between  neurons  could  be  changed  to  adapt  to  a  new  environment 
until  Donald  Hebb  (1949/1988)  made  known  in  his  Organization  of  Behavior  the  first 
neural  network  learning  rule,  which  has  come  to  be  known  as  the  Hebbian  learning 
rule.  The  essence  of  the  Hebbian  learning  rule  states  that  the  synapse  (weight)  be- 
tween two  neurons  should  be  strengthened  if  both  neurons  fire  (in  active  states),  and 
the  synapse  should  be  weakened  if  only  one  of  them  fires.  The  Hebbian  learning  rule 
was  proposed  without  rigorous  mathematical  derivation,  but  it  has  been  regarded 
as  a  foundation  of  many  more  sophisticated  learning  rules.  Its  generic  nature  and 
its  ability  to  capture  the  learning  behavior  in  biological  systems  (Caudill,  1989)  has 
contributed  to  its  continued  utilization. 

A  milestone  in  neural  network  history  was  the  introduction  of  the  perceptron  by 
Frank  Rosenblatt  (1962).  A  perceptron  is  a  single  M-P  neuron  or  a  set  of  M-P  neu- 
rons that  systematically  adjusts  its  (their)  weights  and  excitatory  thresholds  to  learn 
a  given  input-output  association.  The  perceptron  learning  rule  is  an  adapted,  sys- 
temized  Hebbian  rule.  In  Principles  of  Neurodynamics,  Rosenblatt  (1962)  proved  the 
perceptron  convergence  theorem.  This  theorem  shows  that  a  perceptron  can  learn  in 
finite  time  any  pattern  association  that  is  linearly  separable.  The  perceptron  conver- 
gence theorem  was  powerful  enough  to  stimulate  widespread  interest  in  perceptron 
learning.  There  was  much  speculation  about  how  intelligence  could  arise  from  such 
neuron-like  devices. 

The  limitation  of  perceptrons  to  binary  outputs  was  removed  by  Widrow  and 
Hoff  (1960).  They  replaced  the  hard-limit  activation  function  in  perceptrons  with  a 
semilinear  activation  function.  Their  model  was  named  Adaline,  for  adaptive  linear 
neurons.  Adaline  was  later  expanded  to  Madaline,  for  multiple  Adalines  (Widrow 
and  Stearns,  1985),  which  are  applied  to  learning  association  with  multiple  output 
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classes.    Adaline  and  Madaline  were  also  proved  to  be  convergent  to  any  function 
they  could  represent  (Wasserman,  1989). 

The  enthusiasm  with  perceptrons  dwindled  when  researchers  in  the  area  found 

that  perceptrons  failed  to  live  up  to  their  expectations.  The  publication  of  the  book 

Perceptron  by  Minsky  and  Papert  (1969)  initiated  a  dark  age  for  neural  network 

research.  The  authors  performed  a  rigorous  mathematical  analysis  of  the  capability 

and  limitations  of  the  perceptron.  They  showed  that  the  class  of  problems  that  can 

be  effectively  solved  by  perceptrons  is  limited  to  linearly  separable  problems.  Indeed, 

perceptrons  fail  to  solve  such  simple  problems  as  the  Exclusive-Or  (XOR)  problem. 

(More  detailed  discussion  on  the  XOR  problem  is  presented  in  Chapter  3).    With 

linear  activation  functions,  a  multilayered  perception  is  equivalent  to  a  single-layer 

perceptron.  So  multilayer  perceptrons  could  do  no  better  than  solving  linear  separable 

problems.  For  multilayer  perceptrons  with  a  nonlinear  activation  function,  there  still 

did  not  exist  an  effective  training  algorithm.  This  seemingly  incombatable  difficulty 

in  training  multilayered  perceptrons  led  to  the  following  inconclusive  conclusion  of 

Minsky  and  Papert  (1969,  p.  231).  They  wrote: 

The  perceptron  .  .  .  has  many  features  that  attract  attention.  Its 
linearity;  its  intriguing  learning  theorem;  its  clear  paradigmatic  simplicity 
as  a  kind  of  parallel  computation.  There  is  no  reason  to  suppose  that  any 
of  these  virtues  carry  over  to  the  many-layered  version.  Nevertheless,  we 
consider  it  to  be  an  important  research  problem  to  elucidate  (or  reject) 
our  intuitive  judgment  that  the  extension  is  sterile. 

Despite  Minsky  and  Papert's  recognition  of  the  importance  of  multilayered  percep- 
trons, their  pessimism,  backed  up  with  their  reputation  and  the  rigor  of  their  work, 
effectively  turned  mainstream  research  away  from  neural  networks. 

Nevertheless,  research  in  neural  networks  did  not  completely  die  out.  With  ded- 
icated effort,  a  small  group  of  researchers  continued  their  work  in  this  largely  aban- 
doned field.  Some  important  progress  made  during  the  "post  perceptron  era"  (the 
70's)  include  Giossberg's  adaptive  resonance  theory  (ART),  Anderson's  associate 
memory,  and  Kohonen's  self-organizing  network  (Anderson  and  Rosenfeld,  1988). 

Hopfiled's  work  on  a  particular  recurrent  network— the  Hopfield  network— marked 
a  turning  point  in  neural  network  history.  A  distinguished  physicist,  he  is  credited 
with  reviving  public  interest  in  neural  network  models.  Hopfield  showed  for  the  first 
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time  that  a  fully  connected  recurrent  network  exhibits  emergent  collective  compu- 
tational capability  (Hopfield,  1982),  that  is,  that  the  local  interactions  among  the 
processing  units  can  produce  global  behaviors.  His  model  was  later  expanded  to 
allow  neurons  to  have  continuous  values  (Hopfield,  1984)  and  be  applied  to  hard 
optimization  problems  (Hopfield  and  Tank,  1985). 

The  new  era  of  neural  network  study  witnessed  a  resurgence  with  the  publication 
of  the  three  volumn  Parallel  Distributed  Processing  by  Rumelhart,  McClelland  and 
the  PDP  Research  Group  in  1986.  By  then,  some  theoretical  background  had  been 
established,  and  there  had  been  breakthroughs  in  the  neurobiological  understanding 
and  computer  capabilities  (which  made  it  feasible  to  develop  and  test  more  sophisti- 
cated models).  The  PDP  books  were  well  publicized  and  stimulated  a  new  fever  of 
neural  net  research  that  more  than  rivaled  that  which  had  occurred  in  the  early  60's. 
Of  particular  importance  is  the  backpropagation  (BP)  learning  algorithm  developed 
by  Rumelhart,  Hinton,  and  Williams  (1986). 2  BP  provides  a  procedure  that  success- 
fully solves  the  "credit  assignment"  problem3  in  multilayered  perceptron  training, 
and  hence  provides  a  rebuttal  to  Minsky  and  Papert's  conjecture  that  research  in 
multilayered  perceptrons  would  be  futile.  Indeed,  Rumelhart,  Hinton,  and  Williams 
(1986)  showed  that  multilayered  networks  with  BP  learning  were  able  to  solve  a  wide 
variety  of  nonlinear  classification  problems,  including  the  notorious  XOR  problem, 
backpropagation  has  become  the  backbone  of  current  neural  network  research. 

2.3     Neural  Network  Applications 

The  continued  and  ever-increasing  interest  in  neural  net  study  has  been  both  a 
consequence  of  and  a  driving  force  for  successful  applications.  In  many  areas  neural 
nets  offer  a  different  (drastically,  sometimes)  method  of  approaching  a  problem,  and 
open  new  avenues  to  attack  traditionally  intractable  tasks  or  to  solve  more  efficiently 
problems  that  are  being  solved  with  traditional  methods.    In  the  following  we  will 

2It  was  later  found  that  Parker  (1985)  had  derived  the  algorithm  and  called  it  "learning  logic"- 
Werbos  (1974)  had  derived  the  algorithm  in  his  Ph.D.  dissertation  at  Harvard  University  and  called 
it  "dynamic  feedback";  le  Cun  (1988)  discovered  that.  Bryson  and  Ho  (1969)  had  described  the  algo- 
rithm in  the  context  of  optimal  control;  White  (1989)  showed  that  the  backpropagation  algorithm 
can  be  viewed  as  an  application  of  Robbins-Monro's  (1951)  method  of  "stochastic  approximation  " 
The  problem  is  to  find  which  intermediate  step  of  a  process  affects  the  final  results  and  how  it 
does  so  and  to  modify  that  step  to  improve  the  final  results. 
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survey  the  applications  of  neural  nets  in  artificial  intelligence  (AI),  decision  sciences, 
business,  and  engineering,  while  largely  omitting  the  bulk  of  research  in  cognitive 
science,  psychology,  and  neuroscience. 

2.3.1     Neural  Networks  in  AT 

Traditional  AI,  as  a  rival  of  neural  networks,  has  been  successful  in  the  70's.  AI, 
in  particular  expert  systems,  has  found  many  fruitful  applications.  Tasks  that  were 
regarded  as  requiring  high  intelligence,  such  as  chess  playing  and  theorem  proving, 
can  be  accomplished  by  expert  systems  with  remarkable  performance.  Traditional 
AI  approaches  are,  however,  inefficient  in  solving  pattern  recognition  problems,  such 
as  vision  and  speech  processing,  due  to  their  nature  of  symbolic  representation  and 
serial  processing.  Expert  system  development  has  been  hindered  by  the  notorious 
knowledge  acquisition  bottleneck.  For  one  thing,  experts  are  rare.  Perhaps  more  im- 
portantly, expert  knowledge  cannot  simply  be  put  down  as  a  set  of  precise  rules.  The 
parallel  distributed  processing  paradigm  of  neural  nets  seems  a  promising  alternative 
to  overcome  the  difficulties  in  AI. 

On  the  other  hand,  the  success  and  advantages  of  traditional  AI  approaches  are 
not  deniable.  One  noticeable  inroad  that  neural  nets  have  made  into  traditional  AI  is 
the  integration  of  the  two  seemingly  different  approaches.  Several  ways  of  integrating 
neural  nets  with  AI  systems  are  discussed  in  Caudill  (1990).  Lamberts  (1988)  built  a 
hybrid  system  where  neural  nets  were  used  as  a  front -end  processor  that  performs  low 
level  learning  while  an  expert  system  performs  high  level  reasoning.  The  inference 
attained  by  the  expert  system  from  processing  the  output  of  the  neural  nets  is  used 
as  a  guide  to  modify  the  neural  network  weights. 

Becker  and  Peng  (1987)  proposed  a  method  for  integrating  neural  nets  and  sym- 
bolic processing.  Gallant  (1988)  worked  on  the  problem  of  extracting  production 
rules  from  neural  nets,  using  a  limited  set  of  values  for  the  activation  functions.  The 
'•connectionist  expert  system."  as  Gallant  put  it.  was  applied  to  diagnostic  problems. 
A  similar  model  was  developed  by  Yun  and  Reggia  (1989).  By  restricting  the  input 
domain,  Bochereau  and  Bourgine  (1990)  were  able  to  extract  rules  from  a  trained 
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multilayer  neural  network.  Maskara  and  Neetzed  (1990)  used  neural  nets  as  an  effi- 
cient front-end  for  a  rule-based  system  where  the  neural  network  was  trained  to  learn 
the  associations  of  the  expert  system  rules.  Similar  to  a  content  addressable  mem- 
ory, upon  receiving  partial  rule  descriptions,  the  neural  network  outputs  all  applicable 
rules. 

Neural  nets  appear  well  suited  for  fuzzy  learning.  Shiue  and  Grondin  (1987) 
developed  a  fuzzy-learning  neural  automata.  Hayashi  and  Nakai  (1989,  1990)  used 
neural  nets  to  generate  fuzzy  rules.  Fuzzy  production  rules  and  their  membership 
function  can  be  implemented  in  structured  neural  nets  (Yamaguchi  et  al.,  1990). 

In  the  mapping  of  rule-based  systems  to  neural  nets,  a  concept  (feature,  word, 
symbol,  variable,  fact,  predicate,  etc.)  may  be  represented  as  a  unit,  and  logic  rela- 
tions between  concepts  may  be  represented  by  the  connections  between  units.  The 
strength  (weights)  of  the  connections  then  correspond  to  the  degree  of  certainty  of 
the  logic  relations  (Tan  et  al.,  1990;  Yang  and  Bhargava,  1990).  Thus  learning  in 
neural  nets  can  be  regarded  as  modifying  the  certainty  of  the  rules.  Kuncicky  (1990) 
proposed  an  isomorphism  that  maps  from  not  only  rule-based  systems  to  neural  nets, 
but  also  from  neural  nets  to  a  rule-based  system.  The  number  and  structure  of  the 
rules  may  change  in  such  a  hybrid  system  as  a  result  of  neural  network  learning. 
Kerce  and  Mueller  (1990)  used  a  heuristic  link  neural  network  that  is  applied  to 
state  space  search.  A  feedforward  neural  network  is  employed  that  takes  the  state 
description  as  inputs,  and  its  output  is  used  as  a  guiding  heuristic  for  the  state  space 
search. 

Successful  applications  of  neural  nets  in  A I  areas  (such  as  control,  vision,  robot, 
speech,  and  game  playing)  are  numerous  (Wang  and  Yeh,  1990).  One  of  the  most 
influential  applications  is  the  NETtalk  by  Sejnowski  and  Rosenberg  (1986).  NETtalk 
is  a  simple  two-layer  feedforward  neural  network.  Given  a  series  of  examples  of 
English  text  and  the  correct  pronunciation,  NETtalk  was  able  to  learn  to  read  English 
aloud,  with  proper  speed  and  intonation. 

All  the  problems  mentioned  above  to  some  extent  require  automatic  learning. 
The  learning  in  neural  nets  is  nonalgorithmic.  This  is  a  form  of  "extensional  pro- 
gramming" (Cottrell  et  al..  1987)  in  which  learning  is  achieved  through  repeated 
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exposition  to  examples  that  embed  the  target  concept  (to  be  learned).  In  contrast 
a  conventional  computer  requires  algorithmic  approaches,  or  "intensional  program- 
ming," where  strict  instructions  or  rules  are  followed  with  no  reference  to  specific 
examples.  Extensional  programming  cuts  down  the  needs  in  knowledge  acquisition, 
and  hence  represents  a  powerful  technique  (Knight,  1990). 

2.3.2     Neural  Networks  in  Decision  Sciences 

Neural  nets  provide  a  powerful  computational  framework  that  extends  its  appli- 
cation scope  far  beyond  traditional  AI  problems.  As  mentioned  above,  neural  nets 
can  be  integrated  with  expert  systems,  and  hence  provide  a  new  way  of  implement- 
ing decision  support  systems.  Under  certain  conditions,  neural  nets  are  equivalent 
to  Bayesian  classifiers.  This  opens  wide  possibilities  for  using  neural  nets  in  decision 
sciences.  The  inherent  properties  of  neural  nets  enable  them  to  do  more  than  just 
statistical  decision  analysis.  Weigend  (1990)  reported  neural  net  classifiers  that  have 
been  shown  to  outperform  statistical  methods.  Burke  (1991)  and  Burke  and  Ignizio 
(1992)  described  several  neural  network  systems  and  their  applications  in  decision 
making.  They  also  discussed  conditions  under  which  neural  nets  would  be  prefer- 
able to  conventional  procedures  and  gave  some  guidelines  for  using  neural  nets  in 
operations  research. 

Hornik,  Stincheombe  and  White  (1989)  and  others  (Hecht-Nielsen,  1989;  Cy- 
benko,  1989)  have  shown  that  multilayer  feedforward  neural  nets  are  universal  ap- 
proximators. Simple  feedforward  neural  nets  with  as  few  as  one  hidden  layer  can 
approximate  any  continuous  input-output  mapping  to  arbitrarily  specified  accuracy 
(the  number  of  hidden  units  may  have  to  go  up  to  infinity,  though).  This  result 
solved  theoretically  the  representation  issue  and  made  neural  nets  a  legitimate  tool 
for  function  approximation  with  numerous  applications  in  system  identification,  de- 
sign, control,  modeling  and  prediction  (Werbos,  1989). 

Neural  nets  have  also  made  headway  in  operations  research  and  management  sci- 
ence. Some  neural  network  models  have  been  used  to  solve  NP-hard  problems  (which 
are  a  class  of  problems  that  cannot  be  solved  by  any  known  deterministic  polynomial 
time  algorithm).  Hopfield  pioneered  the  use  of  neural  nets  in  optimization  (Hopfield 
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and  Tank,  1985).  Besides  Hopfield  networks,  other  neural  nets  used  in  combinatorial 
optimization  include  Boltzmann  machines  (Hinton  and  Sejnowski,  1986),  Cauchy  ma- 
chines, (Jeong  and  Park,  1989)  and  self-organizing  networks  (Durbin  and  Willshaw, 
1987;  Hueter,  1988). 

Ramanujam  and  Sadayappan  (1988)  showed  how  to  map  to  neural  networks  a 
number  of  combinatorial  optimization  problems,  including  the  traveling  salesman 
problem  (TSP),  the  graph  partition  problem,  the  vertex  covering  problem,  and  the 
maximum  clique  problem.  Compared  with  conventional  approaches,  they  reported 
that  neural  network  results  showed  promise.  Xu  and  Tsai  (1991)  did  extensive  exper- 
iments on  the  TSP.  One  of  their  neural-net-based  algorithms  matches  or  outperforms 
the  best  known  heuristics,  the  Lin  and  Kernighan  algorithm  (Lin  and  Kernighan, 
1973).  Also  the  neural-net- based  algorithm  was  shown  to  scale-up  better  than  the 
Lin  and  Kernighan  algorithm.  Foo  and  Takefuji  (1988a,b)  applied  a  stochastic  neural 
network  for  job-shop  scheduling.  A  deterministic  approach  was  also  used  by  Foo  and 
Takefuji  (1988c)  to  solve  the  same  problem  with  neural  network  implemented  integer 
linear  programming. 

A  relatively  new  advance  of  neural  nets  has  been  made  in  the  area  of  mathe- 
matical programming.  Maa  and  Shanblatt  (1989,  1990)  applied  neural  nets  to  linear 
programming  problems.    Kennedy  and  Chua  (1988)  used  neural  nets  for  nonlinear 
programming.  Barbosa  and  de  Carralho  (1990)  applied  neural  nets  in  feasible  direc- 
tion linear  programming.   An  adaptive  feedforward  neural  net  was  used  in  multiple 
criteria  decision  making  (Zhen  and  Malakooti,  1990).  Other  applications  include  the 
shortest  path  (Helton,  1990),  routing  (Zhang  and  Thomopoulos,  1989),  the  knapsack 
problem  (Li,  Fang  and  Wilson,  1989),  and  the  task  assignment  (Tanaka  et  al.,  1989). 
Neural  nets  are  rivalling  traditional  statistical  analysis  in  classification  (Pratt  and 
Kamm,  1991),  principal  components  analysis  (Baldi,  1989),  regression  (Orris  and 
Feeser,  1991),  and  forecasting  (Sharda  and  Patil,  1990).    Choukri  et  al.    (1991)  re- 
ported that  multilayer  neural  nets  outperformed  statistical  discriminant  analysis  and 
attributed  the  success  of  neural  nets  to  their  nonlinear  activation  functions.  Weigend 
et  al.    (1990)  applied  feedforward  neural  nets  to  forecasting  with  noisy  real- world 
data  from  sun  spots  and  computational  ecosystems.    The  neural  network,  trained 
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with  past  data,  generated  accurate  predictions  and  consistently  out-performed  tradi- 
tional statistical  methods  such  as  the  TAR  (threshold  autoregressive)  model  (Tong 
et  al.,  1980).  Compared  with  an  established  time  series  forecasting  technique— the 
Box- Jenkins  method— neural  nets  have  the  advantages  of  automatic  learning,  better 
performance  for  nonstationary  series  and  long-term  forecasting  (Tang,  de  Almeida 
and  Fishwick,  1990). 

With  the  abilities  of  model  identification,  generalization,  and  prediction,  neural 
nets  have  found  many  applications  in  business  and  engineering.  In  business,  neu- 
ral nets  have  been  successfully  applied  to  loan  evaluation  (Judge,  1989),  signature 
recognition  (Rochester,  1990),  stock  market  forecasting  (Dutta  and  Shekhar,  1988) 
and  other  classification  analysis  (Fisher  and  McKusick,  1989;  Singleton  and  Surkan, 
1990). 

In  engineering,  neural  nets  have  been  applied  to  hardware  fault  diagnosis  (Tan 
et  al.  1990),  power  system  state  evaluation  (Nishimura  and  Arai,  1990),  wastewater 
treatment  system  (Krovvidy  and  Wee,  1990),  and  intelligent  FMS  (flexible  manu- 
facturing system)  scheduling  (Rabelo,  Alptekin  and  Kiran,  1990).  The  potential  of 
neural  nets  as  an  engineering  design  tool  is  still  being  explored.  New  applications 
are  emerging  in  a  variety  of  engineering  areas.  Wu  et  al.  (1990)  used  neural-net- 
based  systems  to  model  the  behavior  of  materials  and  obtained  promising  results. 
Neubauer  (1991)  applied  neural  networks  to  metal  processing.  Neural  nets  have  also 
been  used  in  structural  mechanics  computation,  transportation  and  other  engineering 
applications  (Sun  and  Fu,  1991;  Dagli  and  Lammers,  1989). 

2.4     Promise  and  Problems 

Unlike  the  hype  surrounding  neural  nets  30  years  ago,  today's  neural  net  research 
has  aimed  at  solving  real-world  problems.  Nearly  all  the  big  companies  in  the  com- 
puter industry— AT&T,  IBM,  Texas  Instruments  and  others— are  involved  in  the 
development  of  neural  networks.  Specialized  commercial  connectionist  hardware  and 
software  with  increasing  speed  and  capacities  are  being  marketed  (Simpson,  1990). 
The  neural  net  wave,  which  started  around  1987  when  the  first  International  Confer- 
ence on  Neural  Networks  (ICNN)  drew  over  2000  "neo  neural  net  believers"  to  San 
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Diego,  has  retained  its  momentum  with  the  participation  of  researchers  from  more 
and  more  diversified  areas  and  pump-priming  funding  from  NSF,  NASA,  DARPA, 
and  other  major  sponsors.  Judging  from  their  success  in  the  past  few  years  and  the 
still  widening  and  deepening  scope,  we  may  conclude  that  neural  nets  indeed  hold 
great  promise. 

The  current  optimism  in  neural  nets'  future  is  no  less  fantastic  than  that  in  the 
early  60's.  Neural  nets,  along  with  nuclear  technology  and  superconductivity,  has 
been  dubbed  one  of  the  greatest  inventions  in  our  modern  society.  Leon  Cooper,  a 
Nobel  laureate,  commented  (in  IJCNN,  1990)  that  what  neural  nets  would  be  for  the 
next  century  is  what  the  computer  is  for  today.  Hecht-Nielsen  (1986)  went  further 
saying: 

.  .  .  It  is  clear  that  if  [neural  network  technology]  realizes  its  stated  goals, 
its  impact  on  human  society  will  be  profound.  ...  It  may  thus  come  to 
pass  that  we  are  now  living  at  the  boundary  between  two  great  epochs 
of  human  existence;  namely,  the  transition  from  Civilization  to  Ubility 
[a  term  coined  by  Hecht-Nielson  to  describe  the  imaginary  future  noble 
society].  It  has  been  10,000  years  since  the  last  such  transition  (from 
Culture  to  Civilization).  If  all  of  this  is  true,  we  are  most  fortunate  to  be 
alive  to  witness  and  participate  in  this  change. 

While  a  repeat  of  neural  network  history  in  the  late  60's  seems  unlikely,  we  need  to 
be  very  cautious  about  overly  optimistic  expectations.  None  of  those  startling  claims 
such  as  "brain-like  machines"  in  the  nontechnical  literature  has  really  been  realized. 
It  is  true  that  great  progress  has  been  made.  However,  the  field  is  far  from  mature. 
Current  research  in  neural  nets  faces  many  challenges  in  both  theoretical  study  and 
practical  implementation.  In  the  theoretical  aspect,  a  solid  general  foundation  has 
yet  to  be  established.  There  exist  more  than  a  dozen  different  neural  network  archi- 
tectures that  are  being  used  in  different  problem  domains.  Each  model  has  its  own 
theory  and  implementation  peculiarities.  Little  has  been  done  to  establish  a  com- 
mon ground  for  those  models,  although  Grossberg  at  Boston  University  is  reportedly 
attempting  a  theoretical  framework  that  would  explain  all  neural  behaviors  (Miller, 
1990,  p.  1-17). 

Most  neural  net  models  lack  the  ability  to  explain  how  a  decision  is  made.  They 
are  generally  perceived  as  "black  boxes."  More  often  than  not,  the  inability  to  under- 
stand what  is  going  on  inside  those  "black  boxes"  results  in  them  being  less  appealing 
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than  they  would  be  otherwise.  Recent  progress  has  shed  some  light  into  the  "black 
boxes"  (Fu,  1991),  but  the  overall  picture  is  still  obscure. 

The  leading  neural  network  model— the  multilayered  feedforward  neural  network 
with  backpropagation— suffers  the  same  obscurity.  BP  has  been  widely  used  in  many 
applications,  often  with  encouraging  results.  The  theory  behind  BP  is,  however,  far 
from  soundly  established.  BP  is  a  simple  and  elegant  procedure  that  overcomes  the 
difficulty  of  "credit  assignment."  But  this  procedure  has  some  fundamental  limita- 
tions as  listed  below:4 

1.  Learning  (training)  is  generally  slow. 

2.  No  convergence  results  have  been  established  for  pattern  training— the  most 
commonly  used  training  procedure. 

3.  Convergence  of  epoch  training  to  a  local  minimum  is  achieved,  but  a  strictly 
local  minimum  may  not  represent  a  desired  solution. 

4.  The  parameters,  namely,  the  learning  rate  rj  and  the  momentum  a,  need  to  be 
set  empirically. 

5.  The  structure  of  the  network  (number  of  layers  and  units)  is  determined  arbi- 
trarily. 

6.  The  model  offers  the  flexibility  of  choosing  training  schemes  (epoch  or  pattern) 
and  different  global  criterion  function  and  neuron  activation  functions,  but  no 
general  guidelines  exist. 

Extensive  work  has  been  done  to  explore  BP's  potential  and  overcome  its  limita- 
tions in  the  last  few  years.  A  great  research  effort  is  devoted  to  overcome  the  first 
problem  mentioned  above.  A  number  of  local  acceleration  heuristics  are  discussed  in 
Jacobs  (1988).  Other  approaches  to  improve  the  speed  of  convergence  include  the 
use  of  second  order  information  of  the  error  surface  such  as  Newton's  method  and 
conjugate  gradient  methods  (Moller,  1990;  Becker  and  le  Cun,  1988:  Johansson  et 
4Technical  terms  will  be  defined  precisely  in  the  next  chapter. 
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al.,  1990).  Those  improvements  on  backpropagation  often  increase  the  learning  speed 
significantly  in  terms  of  training  epochs  at  the  cost  of  an  increased  computational 
effort. 

Few  researchers  have  considered  the  second  and  third  problems  of  BP.  It  has 
been  reported  that  BP  with  pattern  training  works  better  than  epoch  training  for 
a  large  training  sample.  But  no  theoretical  account  for  this  phenomenon  has  been 
thoroughly  carried  out.  Most  people  choose  to  use  epoch  training  or  pattern  training 
arbitrarily.  This  leads  to  potentially  erroneous  conclusions  about  the  efficacy  of  the 
algorithm. 

For  the  global  convergence  problem,  empirical  results  have  shown  that  with  am- 
ple hidden  units  embedded  in  the  network,  BP  can  usually  escape  a  local  minimum 
(Rumelhart  et  al.,  1986)  probably  due  to  large  degrees  of  freedom.  However,  increas- 
ing hidden  units  in  the  network  may  not  be  an  appealing  idea,  since  an  unnecessarily 
large  number  of  hidden  units  is  likely  to  decrease  the  generalization  capability  of  the 
network  (Kruschke  and  Movellan,  1989;  Baum  and  Haussler.  1989)  and  may  cause 
overfitting  problems  (Weigend  et  al.,  1990).  Fang  and  Li  (1991)  have  adapted  simu- 
lated annealing  methods  to  neural  network  training.  Their  approach  guarantees  the 
solution  will  be  globally  optimal,  if  a  proper  annealing  schedule  is  derived  for  the 
given  problem.  Montana  and  Davis  (1989)  and  Belew  et  al.  (1990)  used  genetic 
algorithms  to  train  the  feedforward  neural  nets.  The  drawback  of  these  approaches 
is  that  they  involve  a  random  search  (sometimes  blindly)  and,  hence,  are  not  efficient 
in  general. 

In  the  interest  of  efficiency  and  generalization,  the  complexity  of  a  neural  network 
should  be  kept  to  its  bare  minimum.  Some  researchers  (Teh  and  Yu,  1988;  Sietsma 
and  Dow,  1988)  developed  heuristic  rules  for  pruning  away  inessential  hidden  units 
during  training,  starting  with  an  oversized  network.  Others  (e.g.,  Tenorio  and  Lee, 
1989)  used  dynamic  procedures  that  generate  new  units  as  needed.  In  those  ap- 
proaches a  trade-off  usually  has  to  be  made  between  the  estimation  accuracy  and 
the  neural  network  (structural)  complexity.  Weigend  et  al.  (1990)  proposed  to  add 
a  penalty  term  to  the  global  criterion  function  to  discourage  the  use  of  excessively 
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large  network.  This  method  has  been  used  in  Chauvin  (1990)  and  others.  One  of  the 
drawbacks  of  this  approach  is  that  training  time  increases  noticeably. 

The  deficiency  of  neural  nets,  in  particular  of  backpropagation,  indicates  that 
much  theoretical  work  needs  to  be  done  before  we  can  fully  explore  the  potential  of 
this  emerging  computation  framework.  We  are  not  sure  whether  or  when  a  profound 
common  theoretical  basis  for  all  neural  network  paradigms  will  emerge.  But  what  we 
can  do  now  is  to  conduct  a  rigorous,  systematic  study  of  the  major  neural  net  models, 
study  the  efficacy  and  efficiency  of  them,  identify  the  conditions  under  which  they 
may  be  effectively  applied,  explore  the  theoretical  capabilities  and  limitations,  and 
build  new  and  improved  procedures  based  on  the  theoretical  guidelines.  By  doing  so 
we  can  hope  to  better  understand  this  new  field  and  its  future  and  proceed  gradually 
to  realize  its  potential  to  the  fullest  extent. 


CHAPTFR  ^ 
FEEDFORWARD  NEURAL  NETWORKS 

Feedforward  neural  nets  (FNN)  are  the  most  popular  neural  network  paradigms 
in  the  computation  modeling  branch  of  neural  net  research.  The  principal  learning 
algorithm  for  training  FNN  is  the  backpropagation  (BP)  algorithm.  The  popularity  of 
BP  arises  from  its  simplicity  and  successful  applications  to  many  real-world  problems. 
This  chapter  will  discuss  the  development  of  the  backpropagation  learning  algorithm. 
The  efficacy  and  limitations  of  the  BP  algorithm  will  be  analyzed  while  improvement 
of  the  classic  algorithm  will  be  presented  in  the  next  chapter.  We  will  give  basic 
definitions  and  present  theorems  about  the  representation  capability  of  general  FNN. 
We  start  with  the  building  block  of  a  neural  network— the  neurons— and  then  the 
first  workable  neural  network— the  perception.  Feedforward  neural  nets  are  built 
upon  perceptrons.1 

3.1     The  Processing  Units  ( Neurons) 

There  have  been  many  nonstandard  terminologies  used  in  the  neural  net  literature. 
We  will  stick  to  the  most  general  ones  throughout  our  discussion.  In  some  cases  we 
use  two  terms  interchangably,  e.g.,  processing  unit  and  neuron;  we  will  include  both 
terms  in  the  definition. 

Definition  3. 1  (Processing  Unit)  A  processing  unit  (neuron)  is  the  basic  element  of 
an  artificial  neural  network.  A  neuron  consists  of  multiple  input  connections  from 
other  neurons;  a  transfer  function  that  maps  the  inputs  to  a  scaler;  an  activation 
function  that  maps  the  scaler  to  a  real  or  binary  activation  (state);  and  an  output 
that  broadcasts  the  activation  to  one  or  many  other  neurons. 

Although  biologically  motivated,  the  processing  units  in  neural  networks  only  re- 
motely resemble  that  of  a  biological  neuron. 

^any  people  mistakenly  think  that  an  FNN  is  a  multilayered  perception.    But  they  are  not 
equivalent.  The  differences  will  be  clear  following  their  definitions. 
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Figure  3.1.  Structure  of  a  single  neuron 

The  first  such  processing  unit  was  the  McCulloch-Pitts  neuron.  This  basic  model 
is  still  widely  used  today.  It  has  a  multiple  input  port  and  a  single  output  port.  Before 
the  inputs  are  fed  into  the  neuron,  they  are  multiplied  by  corresponding  weights  on 
their  pathways.  The  output  is  produced  by  taking  the  weighted  sum  of  the  inputs 
and  thresholding  it  via  a  heaviside  (threshold)  function.  A  heaviside  function  returns 
one  of  two  discrete  values,  a  and  6,  where  a,  b  €  R,a  <  b.  Depending  on  whether  the 
input  is  greater  than  or  less  than  the  threshold  6,  b  or  a  is  returned.  It  is  common 
to  set  a  =  0,  and  6=  1.  A  sketch  of  the  model  is  shown  in  Figure  3.1. 

Definition  3.2  (Net  Input)    The  net  input  results  from  mapping  multiple  inputs  to  a 
real  or  integer  value.  Frequently  this  takes  the  form  of  a  weighted  sum  of  the  inputs. 

Definition  3.3  (Activation  Function)    The  activation  function  is  a  function  that  maps 
the  net  input  to  a  real  or  binary  activation  value  (state)  of  the  processing  unit. 

Besides  the  heaviside  function,  other  commonly  used  activation  functions  include 
the  semilinear  function  and  the  sigmoid  function.  The  semilinear  function  is  a  non- 
decreasing  function,  linear  in  a  certain  range  and  constant  outside  that  range.  The 
sigmoid  function  is  a  differentiable,  monotonically  increasing  function.  The  introduc- 
tion of  continuous  activation  functions,  especially  the  sigmoid  function,  has  greatly 
enhanced  the  capability  and  trainability  of  the  multilayered  neural  networks.  The 
shapes  of  the  basic  activation  functions  are  shown  in  Figure  3.2. 
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3.2     The  Perception  Learning 

In  the  following,  we  give  definitions  concerning  perceptron  learning  and  then 
present  the  learning  algorithm  and  its  finite  convergence  theorem. 

Definition  3.4  (Learning  Rule)  A  learning  rule  (algorithm)  is  the  procedure  by  which 
an  artificial  neural  network  adjusts  the  internal  representation  (weights  and  thresh- 
olds) of  its  environment. 

Definition  3.5  (Perceptron)  A  perceptron  is  a  simple  neural  network  consisting  of  a 
single  or  set  of  processing  units  with  heaviside  activation  functions  and  the  perceptron 
learning  algorithm. 

Definition  3.6  (Training  Set)  A  training  set  T  is  a  sample  taken  from  a  given  popu- 
lation. This  sample  is  used  as  the  environment  of  the  neural  network  providing  inputs 
and  target  values  (if  applicable)  to  the  network. 

Definition  3.7  (Instance)  Any  particular  element  x  of  the  training  set  T  is  an  in- 
stance, x  may  have  binary  or  real-valued  attributes. 

Definition  3.8  (Sample  Training)  Sample  (epoch)  training  refers  to  a  neural  net  train- 
ing method  whereby  the.  network  weights  (including  thresholds)  are  updated  after  the 
presentation  of  all  instances  in  the  training  set. 

Definition  3.9  (Instance  Training)  Instance  (pattern)  training  refers  to  a  neural  net 
training  method  whereby  the  network  weights  are  updated  after  the  presentation  of 
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each  instance  of  the  training  sample.  If  the  instance  is  chosen  sequentially  from  the 
sample,  it  is  called  sequential  instance  training  (sequential  training).  If  the  instance 
is  chosen  randomly  from  the  sample,  it  is  called  randomized  instance  training  (ran- 
domized training). 

Note  that  an  instance,  x,  is  an  example  of  some  concept  (hypothesis)  to  be  learned. 
In  the  neural  net  training  process,  both  the  instances  and  the  concepts  associated 
with  the  instances  are  provided  to  the  network.2  Let  x  £  RT  be  an  instance,  T+ 
denote  the  set  of  positive  instances  (a  positive  instance  is  an  example  of  the  target 
concept  or  class)  and  T~  denote  the  set  of  negative  instances  (a  negative  instance 
is  a  counterexample  of  the  target  concept  or  class).  Let  w  £  RT  be  a  weight  vector. 
The  perceptron  learning  algorithm  can  be  stated  as  follows: 


The  Perceptron  Learning  Algorithm(PLA): 


START: 
TEST: 


ADD: 


SUBTRACT: 


Set  w  £  Bn  randomly. 

Let  X  =  {x\(x  £  T+  and  wx  <  0)  or  [x  £  T~  and  wx  >  0)}. 

If  X  =  0,  stop. 

Otherwise: 

pick  any  x  £  X, 

iix£  T+,go  to  ADD, 

if  x£  T-,  go  to  SUBTRACT. 

w  <—  W  +  X, 

go  to  TEST 
w  <—  iv  —  x, 
go  to  TEST 


To  facilitate  the  discussion  on  perceptron  convergence,  the  following  definitions  are 
needed: 


This  is  called  supervised  learning,  the  case  we  are  dealing  with.  Training  with  x  only  is  referred 
to  as  unsupervised  learning.  The  Kohonen  neural  network  is  an  example  of  unsupervised  learning. 
The  product  of  two  variables  u>  and  x  written  in  the  form  of  wx  is  assumed  to  be  their  inner 
product:  wx  =  J£J_,  u»,x,.  Sometimes  we  use  u>  ■  x  for  clarity. 
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Definition  3.10  (Convex  Set)  A  set  S  C  BT  is  convex  if  for  each  x,y  e  S  and  any 
A  6  [0, 1] 

z  =  \x  +  (l  -X)y  es 

Definition  3.11  (Convex  Hull)   Let  S  C  RT  be  either  finite  or  infinite,  the  convex  hull 
of  S,  denoted  by  h(S),  is  the  smallest  convex  set  that  contains  S. 

Definition  3.12  (Bounded  Set  I  A  set  S  G  Br  is  bounded  if  there  exists  M  6  R,  M  >  0, 
such  that 

ScB0(M)  =  {x€iT|  ||.t||  <  M} 

Definition  3.13  (Linearly  Separable)  Let  SUS2  C  Br  be  either  finite  or  infinite,  5, 
and  S2  are  linearly  separable4  if  there  exists  a  nonzero  vector  p  g  Rr  and  a  scalar 
a  G  R  such  that 

x-p>a     VxehiSx) 

and 

yp<a     Vy  e  h{S2). 

Theorem  3.1  (Berceptron  Convergence)  Suppose  T+  and  T~  are  bounded  sets  in  Br 
and  are  linearly  separable,  then  the  perceptron  learning  algorithm  will  find  a  hyper- 
plane  that  separates  T+  and  T~  in  finite  time. 

Broof:  5 

Let  H  =  T+  U  ~T~,  then  the  PLA  produces  the  sequence  of  vectors: 

wn+1  =wn  +  xn,      r, .  =  0,1,... 

where  iv°  is  arbitrary,  and  xn  6  H  is  picked  such  that  wn  ■  xn  <  0. 

By  assumption,  there  exists  a  w*  €  B7  and  a  e  R  such  that  w~  •  x  >  a,  for  all  x  €  H. 

Since  H  is  bounded,  we  can  define 

0  =  *UP*€H    INI2- 

"This  definition  is  for  strict  separation  (see  any  standard  convex  analysis  book  such  as  Convex 
Analysis  by  Rockafellor,  1969).  For  simplicity,  we  refer  to  it  as  separable. 

5This  proof  is  an  extended  version  of  the  proof  by  Minsky  and  Papert  (1969). 
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At  step  n,  we  have 

wn  -u,*  <  |K  ||  |K  || 

by  the  Cauchy  Inequality,  where 

wn-w"     =     (w71-1  +xn~1)-w' 

=    u;"-1  •  w*  +  x"-1  •  w" 

>  wn~l  ■  w*  +  a 

>  iv   ■  w"  +  na 
and,  since  wn~l  ■  x""1  <  0  and  ||x||2  <  0, 

IKII2  =   IK-'  +  z-T 

=      ||n/'-1||2  +  2u/'-1..7:"-1  +  ||.7;"-1||2 

<   IK_1II2  +  /? 


<  ||u»T  +  ^ 


Thus  we  have 


iv°  ■  w*  +  na  <  yj\\w°\\*  +  nfl  \\w'\\. 
or  the  quadratic  inequality 

aV  +  (2aw°  ■  w'  -  P\\w*\\2)n  <  |K||2|K||2  -  (w°  ■  w*)2 .  (3.1) 

Since 

k  =  (2aw°  ■  w*  -  (3\\xv'\\2)2  +  4a2(|K||2|K||2  _  („,°  .  w*)2)  >  0, 
given  any  a  and  0,  a  solution  to  (  3.1)  exists  and  is  finite.  Thus,  after  at  most 

/?IKII2  -  2qw°  ■  w*  +  Vk 


'^Tnar:    — 


2o2 

iterations,  the  algorithm  stops  with  a  solution.  This  proves  the  theorem.  D 

If  we  assume  that  w°  =  0,^  =  1  and  \\w\\  =  I,  then  we  have  k  =  1  and  nmax  - 
1/a2,  which  is  the  result  obtained  in  Minsky  and  Papert  (1969). 
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w3  x  =  0 


Figure  3.3.  Geometrical  explanation  of  the  perceptron  learning 

Note  that  the  proof  does  not  assume  finiteness  of  H.  The  PLA  procedure  can 
be  applied  to  infinite  sets,  as  long  as  provisions  are  made  to  carry  out  the  stopping 
criterion  test. 

To  understand  the  perceptron  convergence  procedure  geometrically,  the  following 
concepts  are  useful: 

Definition  3.1  A  (Convex  Cone)  Let  S  C  /?'    be  a  convex  set.    S  is  a  convex  cone  if 
Ax  €  S  for  any  A  >  0  and  any  x  €  S. 

Definition  3.15  (Dual  Cone)  Let  S  C  R' ,  the  dual  cone  of  S .  denoted  by  S*,  is 

S*  =  {y  e  Br\  y  ■  x  >  0      for  every  x  €  S}. 

Geometrically  the  perceptron  learning  procedure  finds  an  interior  point  in  the  dual 
cone  of  H  =  T+U-T~.  Starting  with  any  random  vector  w°,  the  ADD  procedure  (by 
the  definition  of  H,  the  ADD  procedure  now  includes  the  SUBTRACT  procedure) 
generates  a  sequence  of  wn  such  that  wN  ■  x  >  0  for  all  x  €  H.  where  N  is  a  finite 
integer.  This  process  is  illustrated  in  Figure  3.3. 

Viewing  perceptron  learning  as  finding  a  solution  to  a  set  of  linear  inequalities 


w  ■  x  >  0,    V./-  e  H 


(3.2) 
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opens  a  rich  body  of  related  reseearch,  using  approaches  known  as  relaxation  methods 
(see,  for  example,  Agmon,  1954). 

Various  modifications  have  been  suggested  to  the  basic  perceptron  learning  algo- 
rithm. In  step  3  (ADD  weights)  u^1)  «_  «,<•)  +  x(»)  can  be  replaced  by 


w 


<»+!>   <_  „(«)  +  kx(») 


where  k  >  0  is  a  constant.  k  =  l/||x||2  would  make  the  weight  change  by  a 
unit,  vector  in  the  direction  of  x.  Agmon  (1954)  suggested  (in  a  different  context) 
k  =  c(u».z)/||x||a  where  c  G  (0, 2).  The  number  of  iterations  of  the  algorithm  changes 
with  these  variations,  but  the  finite  convergence  property  is  retained.  The  conver- 
gence proof  of  perceptron  variations,  Adaline  (Widrow  and  Hoff,  1960)  and  Madaline 
(Widrow  and  Stearns,  1985)  can  be  found  in  Poliac  (1989). 

The  basic  perceptron  learning  rule  can  be  easily  generalized  to  handle  multiple 
class  problems.  Let  HUH2,  ...,HK  be  the  sets  of  instances  for  each  class.  The  classi- 
fication problem  requires  finding  a  w*  such  that  for  each  xt  e  #, 

to-  •  Xi  >  w*  ■  Xi  +  b 

for  all  j  =  1,2,. ...A', j  ^  i,  where  6  >  0  is  a  scalar.  The  learning  procedure  is 
presented  in  the  following.  Proof  of  the  convergence  of  this  procedure  is  omitted 
since  it  is  a  direct  extension  of  Theorem  3.1. 

Multi-class  Perceptron  Algorithm 

START:  Set  Wi  €  K,i  =  1, 2, . . . ,  K  to  any  random  values. 

TEST:  Let  X,  =  {Xl\x,  G  H,  and  for  some 

j  7^  i  such  that  w,  ■  x,  <  wj  ■  x,} 

If  Xi  =  0  for  all  i  =  1,2 A',  stop. 

Otherwise  pick  any  ;r,  €  A',,  go  to  UPDATE. 
UPDATE:  w,  -  wi  +  Xi, 

Wj  f-  Wj  -  Xi, 

go  to  TEST. 
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Figure  3.4.  The  XOR  problem  and  its  geometrical  representation. 

One  of  the  intriguing  properties  of  the  perception  learning  algorithm  is  that  it  uses 
only  locally  available  information— modifying  weights  after  the  presentation  of  each 
input  pattern.  Yet  the  procedure  constructs  a  globally  optimal  solution  (for  linearly 
separable  patterns).  Local  procedures  are  suitable  for  parallel  implementation  and 
hence  have  the  potential  for  fast,  real-time  applications.  Minsky  and  Papert  (1969) 
pointed  out  that  it  would  be  interesting  to  compare  the  relative  efficiency  of  the 
perceptron  procedure  with  global  analytic  methods,  such  as  linear  programming, 
for  solving  the  system  of  inequalities  3.2.  No  systematic  study  has  been  done  in  the 
comparison  of  perceptron  learning  with  global  analytic  approaches.  Many  researchers 
have,  however,  realized  the  importance  of  "locality"  in  learning  (see,  for  example, 
Jacobs,  1988).  This  issue  is  further  explored  in  later  chapters. 

3.3     The  Limitation  of  Perceptions 

Minsky  and  Papert  (1969)  showed  that  perceptions  failed  to  solve  a  number  of 
simple  pattern  classification  problems,  in  particular,  the  Exclusive  Or  (XOR)  prob- 
lem. The  XOR  problem  has  been  used  extensively  as  a  benchmark  for  neural  network 
algorithm  evaluation  due  to  this  historical  reason.  The  problem  has  four  patterns. 
Each  pattern  has  two  binary  inputs  and  one  binary  output.  The  output  is  true  (with 
value  1)  if  and  only  if  exactly  one  of  the  inputs  is  true.  Geometrically,  the  four  pat- 
terns correspond  to  the  four  vertices  of  a  unit  square.  The  two  vertices  on  the  same 
diagonal  belong  to  the  same  class  (see  Figure  3.4). 
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Figure  3.5.  An  example  of  layered  perceptrons  that  solve  the  XOR  problem 

The  failure  of  the  perceptron  is  due  to  its  insufficient  knowledge  representation, 
not  its  learning  procedure.  Perceptrons  construct  only  linearly  separable  decision 
regions,  but  there  is  no  linearly  separable  region  that  can  solve  the  XOR  problem  as 
can  be  seen  in  Figure  3.4. 

To  solve  the  XOR  problem,  a  more  complex  convex  decision  region  is  needed, 
multilayered  perceptrons  could  form  such  a  decision  region.  For  example,  let  one 
perceptron  separate  pattern  (0,0)  from  the  others,  and  another  perceptron  separate 
pattern  (1,1)  from  the  others.  A  third  perceptron,  taking  the  output  of  the  first  two 
as  input,  could  produce  a  convex  decision  region  that  successfully  classify  pattern 
(0,1)  and  (1,0)  into  one  group.  The  idea  is  depicted  in  Figure  3.5  (following  Beals 
and  Jackson,  1990). 

Thus  multilayered  perceptrons  are  powerful  enough  to  form  polyhedral  convex 
decision  regions.  This  solves  the  representation  problem  of  single  layer  perceptrons. 
Unfortunately,  multilayered  perceptrons  created  a  barrier  to  learning.  The  perceptron 
convergence  learning  procedure  does  not  extend  to  multilayered  perceptrons.  The 
problem  is  that  perceptrons  in  the  second  layer  are  shielded  away  from  the  inputs  by 
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the  heaviside  threshold  function.  The  perception  learning  procedure  can  correctly 
adjust  only  the  weights  between  inputs  and  outputs,  but  not  the  weights  between 
perceptrons.  This  difficult  is  overcome  by  introducing  continuous  activation  functions 
(Rumelhart  et  al.,  1986).  This  is  shown  in  the  next  section. 

M Feedforward  Neural  Nets  and  the  BP  Algorithm 

Definition  3.16  (FNN)  A  feedforward  neural  network  (FNN)  is  a  neural  network  con- 
sisting of  neurons  that  are  arranged  in  layers,  namely,  an  input  layer,  hidden  layer (s), 
and  an  output  laxjer.  Connections  are  unidirectional  from  lower  layers  to  higher  layers 
with  no  feedback  paths. 

By  definition,  multilayer  perceptions  are  a  subset  of  feedforward  neural  nets  with 
heaviside  activation  functions.  But,  conventionally,  when  we  say  feedforward  neural 
nets  we  mean  feedforward  neural  nets  with  continuous  activation  functions  as  distin- 
guished from  perceptrons.  Multilayer  perceptrons  are  able  to  represent  linearly  non- 
separable  problems,  but  there  is  no  efficient,  learning  procedure.  Using  FNN  enables 
us  to  solve  the  neural  net  "credit  assignment"  problem.  Given  the  output  gener- 
ated from  an  input,  which  weights  and  how  should  they  be  changed  to  approximate 
the  desired  output?  The  classic  algorithm  to  train  an  FNN  is  called  backpropaga- 
tion  which  is  a  learning  algorithm  that  modifies  the  network  weights  based  on  their 
contributions  to  a  global  performance  criterion  function.  A  gradient  descent  search 
procedure  is  employed. 

Let  (x,y)  denote  a  training  example  (pattern),  where  x  is  an  input  vector  and 
y  is  the  target  output  vector.  Also,  let  o  denote  the  network  output  and  w  denote 
the  weights  of  the  network.  We  use  AT;  x  Ar„  x  N0  to  represent  the  structure  of  a 
feedforward  neural  net  where  Nh  NH  and  N0  are  the  number  of  input  units,  hidden 
units  and  output  units,  respectively.  Figure  3.G  shows  a  2  x  2  x  2  fully  connected 
feedforward  neural  network.  For  convenience,  only  two  processing  units  are  used  in 
each  layer.  The  indices  for  the  input  layer,  hidden  layer,  and  output  layer  are  chosen 
to  be  i,  j,  and  k,  respectively. 

The  input  units  simply  pass  on  the  input  vector  x.  The  units  in  the  hidden  layer 
and  output  layer  are  processing  units.    The  activation  function  is  chosen  to  be  the 
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Figure  3.6.  A  2  x  2  x  2  feedforward  neural  network 


sigmoid  function 


/(*) 


1 


(3-3) 


1  +  t—<x 

where  7  is  a  constant  controlling  the  slope  of  the  function.     The  net  input  to  a 
processing  unit  j  is  given  by 


netJ  =  Yl  Wi3Xi  +  °J 


(3.4) 


where  ar.'s  are  the  outputs  from  the  previous  layer,  ivtJ  is  the  weight  (connection 
strength)  of  the  link  connecting  unit  i  to  unit  j,  and  9j  the  bias,  which  determines 
the  location  of  the  sigmoid  function  on  the  x  axis.  For  notational  convenience,  we 
let  x0  =  1  and  w0}  =  -Qj,  then  we  have0 

ne/;  =  X>;>rT>.  (3.5) 

i 

The  output  of  a  processing  unit  is  given  by 

°j  =  /{net,)  (3.6) 

6From  now  on,  whenever  we  mention  weights,  thresholds  are  implicitly  included  unless  explicitly 
stated  otherwise. 
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A  feedforward  neural  net  works  by  training  it  with  known  examples.  A  random 
example  (xp,  yp)  is  drawn  from  the  training  set  {(xp,  yp)\p  =  1,2, . . .  ,P),  and  xp  is  fed 
into  the  network  through  the  input  layer.  The  network  computes  an  output  vector 
op  based  on  the  hidden  layer  output.  op  is  compared  against  the  training  target  yp. 
A  performance  criterion  function  is  defined  based  on  the  difference  between  op  and 
yp.  A  commonly  used  criterion  function  is  the  sum  of  squared  error  (SSE)  function 

F  =  E  Fv  =  o  £  £(&*  -  °pk?  (3.7) 

P  L     V       k 

where  p  is  the  index  for  the  pattern  (example)  and  k  the  index  for  output  units. 

The  error  computed  from  the  output  layer  is  backpropagated  through  the  network, 
and  weights  (wt])  are  modified  according  to  their  contribution  to  the  performance 
criterion  function. 

AWiJ  =  "^  ^ 

where  tj  is  called  learning  rate,  which  determines  the  step  size  of  the  weight  updating. 

3.5      Backpropagation  Derivation 

For  easy  of  exposition,  let  us  consider  the  error  resulting  from  a  single  training 
instance: 

Fp  =  vT,(ypk-°pk)2-  (3.9) 

-  k 

For  connections  leading  to  the  output  layer  (refer  to  Figure  3.6),  the  partial  derivative 
of  Fp  with  respect  to  weight  w]k  can  be  written  as 

dFp        dFp      dopk     dnetk 


dwjk      dopk    dnetk     dwjk 


(3.10) 


using  the  chain  rule.  Here 


dFp 

doJk    =    -(^-<Vfc)  (3-H) 

o^rh  =  ■fintt^  (3.12) 

()netk 

lh^    =    °i'  (3.13) 
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Denote 


&  =  -dF> 


dnetk 
=    {yPk-opk)f'(netk).  (3.14) 


Then  we  have 


and 


dF, 


dwjk 


=  -OkOi 


(3.15) 


A  dFv 

=    ^*Oj-  (3.16) 

This  weight  updating  rule  applies  only  to  output  layer  weights  (i.e.,  the  weights 
leading  to  the  output  layer).  Similarly  for  hidden  layer  weights  we  have,  by  the  chain 
rule, 

9FL  =  y   dFv      dnetk      do,      dnetj 


Since 


and 


define 


du>ij      Y  dnetk      doj      dnetj     div 
dFP 


dntl  =  ~6k  (3-18) 


dnetk 

~do7  =  Wk>  (3-19) 


6-    =    -dF* 


dnetj 
=    52hwkf'(netj).  (3.20) 


Then 


and 


dFp 

- —  =  -SjOi 

OWa 


A  dFi 


(3.21) 


dwij 
=    ifyoi-  (3.22) 
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If  the  sigmoid  activation  function  is  used,  we  have 

^ye — met) 


f  (neti)    = 


(1  +  e-^netj)2 
ifinetjXl  -  f(netj)) 


=    70Pj(l-ow).  (3.23) 

Thus  the  derivative  is  easily  obtained  from  the  output  of  the  processing  units.  Other 
performance  criterion  functions  may  be  defined  and  other  activation  functions  may 
be  used.  These  variations  will  be  covered  in  the  next  chapter. 

The  backpropagation  algorithm  is  formally  stated  below: 
Algorithm  BP 
1.  INITIALIZE: 

•  Construct  the  feedforward  neural  network.  Choose  the  number  of  input 
units  and  the  number  of  output  units  equal  to  the  length  of  input  vector 
x  and  the  length  of  target  vector  ;</,  respectively. 

•  Randomize  the  weights  and  bias  in  the  range  (-.5,  .5). 

•  Specify  a  stopping  criterion  such  as  F  <  Fstop  or  n  >  nmax.  Set  iteration 
number  n  =  0. 

2.  FEEDFORWARD: 

•  Compute  the  output  for  the  noninput  units.  The  network  output  for  a 
given  example  p  is 

°pk  =  /(E  «";*/(£  '<W(-  ■  •  /(E  wiVi))))- 

j  "'  t' 

•  Compute  the  error  using  Equation  3.7. 

•  If  a  stopping  criterion  is  met.  stop. 

3.  BACKPROPAGATE: 

•  n  <—  n  +  1 . 
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•  For  each  output  unit  k,  compute 

Sk  =  (ojt  -yk)f'(netk). 

•  For  each  hidden  unit  j,  compute 

Sj  =  /(net^^SkWjk. 
k 

4.  UPDATE: 

AwtJ(n  +  1)  =  ijSjV,  +  aAwij(n) 

where  n  >  0  is  the  learning  rate  (step  size)  and  a  €  [0,1)  is  a  constant  called 
the  momentum. 

5.  REPEAT: 
Go  to  Step  2. 

3.6     The  Representation  Capability  of  FNN 

A  feedforward  neural  net  can  be  regarded  as  a  general  nonlinear  model.  In  effect, 
it  is  a  complex  function  consisting  of  a  convoluted  set  of  transfer  functions  and 
activation  functions  f  eC,  where  C  is  a  set  of  continuously  differentiable  functions, 
and  the  parameter  set  W  called  weights  (including  thresholds).  The  output  of  a 
feedforward  neural  net  can  be  written  as: 

0  =  /(E  uWi(£  wmjfm(.  ■  ■  /,(£  waxi)))).  (3.24) 

3  m  i 

The  next  result  shows  that  a  two-layer  FNN  can  approximate  a  large  class  of  func- 
tions. 

Theorem  3.2  For  any  absolutely  mtegrable  function  g  :  iT  -*  R,  there  exists  a  two 
layer  FNN  '  with  absolutely  integrablt  activation  functions  that  approximate  g  to  any 
arbitrary  accuracy. 


'By  convention,  only  the  hidden  layer  and  the  output  layer  are  counted.  Thus  a  two  layer  FNN 
has  one  hidden  layer  and  one  output  layer. 
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This  theorem  is  a  direct  result  of  Poliac's  (1989)  Theorem  4.8.1.  The  requirement 
of  /  to  be  absolutely  integrable  is  relaxed  by  Hornik,  Stinchombe  and  Write  (1989), 
Cybenko  (1989)  and  others  to  include  the  use  of  sigmoid  activation  functions.  Hornik 
(1991)  further  proved  that  an  FNN  with  as  few  as  a  single  hidden  layer  and  arbitrary 
bounded  and  nonconstant  activation  functions  are  universal  approximators  to  any 
continuous  function  based  on  an  Lp  norm  performance  criterion. 

The  above  results  assume  that  the  number  of  processing  units  in  the  hidden  layer 
is  unlimited.  A  theorem  by  Kolmogorov  (1957)  can  be  applied  to  FNN  to  yield  a 
three  layer  neural  network  that,  with  finite  hidden  layer  units,  can  exactly  represent 
any  continuous  function.8 

Theorem  3.3  (Kolmogorov)  There  exist  fixed  increasing  continuous  functions  htJ  on 
I  =  [0, 1]  such  that  each  continuous  function  g  on  In  =  [0, 1]"  can  be  written  in  the 
form 

2n+l  n 

g(xu...,xn)  =  £  /;(£**,•(*,■)) 

.7  =  1  1  =  1 

where  f:  are  properly  chosen  continuous  functions  of  one  variable. 

The  theorem  suggests  that  any  continuous  functions  of  many  variables  can  be  repre- 
sented as  the  linear  superposition  of  some  continuous  univariate  functions.  In  terms 
of  neural  nets,  this  can  be  interpreted  as  follows.  For  any  continuous  function  of  n 
variables,  there  exists  a  feedforward  neural  network  with  two  hidden  layers,  (each  pro- 
cessing unit  in  the  hidden  layers  has  a  continuous  activation  function),  that  exactly 
represent  g.  A  two-input  network  structure  corresponding  to  Kolmogorov's  theorem 
is  shown  in  Figure  3.7. 

Several  variations  of  Kolmogorov's  theorem  exist  (Lorentz,  1976).  In  particular, 
each  function  f3  can  be  chosen  identically  and  function  hl7  can  be  replaced  by  Uhj, 
where  /,  is  constant  and  hj(x)  is  continuous  and  nondecreasing  (cf.  Poggio  and  Griosi, 
1989).  Thus  g(x)  €  C(I)  can  be  written  as 


2n  +  l 
.7  =  1  J  =  l 


£7i -t  i  n 

g{xu. .  .,*„)  =  J2  /(£  Uhjixi)).  (3.25) 


8Hecht-Nielsen,  1987,  is  credited  for  applying  Kolmogorov's  theorem  to  FNN  for  the  first  time. 
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Figure  3.7.  An  example  of  the  Kolmogorov  neural  network 

Correspondingly,  we  have  the  following  theorem. 

Theorem  3 J,  Given  any  continuous  function  g  :  Rr  -+  R,  there  exists  a  three-layer 
feedforward  neural  network  that  exactly  represent  y,  with  n(n  +  1)  processing  units  in 
the  first  hidden  layer  and   2n  +  1  processing  units  in  the  second  hidden  layer. 


Kolmogorov's  theorem  shows  that  FNN  has  powerful  representation  capability. 
However,  this  theorem  is  nonobstructive.  That  is,  we  know  that  there  exist  such 
functions  hj  and  /,  but  we  have  no  clue  as  how  to  construct  them.  Hence  the 
application  of  Kolmogorov's  theorem  in  neural  nets  has  been  limited  to  theory. 

As  an  illustration  of  FNN's  capability,  we  can  construct  simple  neural  nets  with 
one  or  two  hidden  units  that  solve  the  XOR  problem,  using  the  standard  backprop- 
agation  algorithm.  The  solutions  were  obtained  in  about  250  iteration.  Figure  3.8 
shows  the  network  structures  and  the  weights  and  thresholds  resulting  from  back- 
propagation  training.  Figure  3.9  is  the  output  function  surface  of  the  trained  neural 
network  (2x1x1),  where  the  coordinates  range  from  zero  to  one.  It  clearly  shows 
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Figure  3.8.  Two  simple  neural  nets  that  solve  the  XOR  problem 


Figure  3.9.  Output  function  surface  of  the  2x1x1  network 
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that  the  point  (0,0)  and  (1,1)  are  grouped  together  to  from  one  class  (with  low 
values)  while  the  other  two  points  make  the  other  class. 


CHAPTER  4 
VARIATIONS  OF  BACKPROPAGATION  LEARNING 

The  backpropagation  algorithm,  due  to  its  simplicity  and  general  applicability, 
has  quickly  become  the  dominate  training  algorithm  for  feedforward  neural  networks. 
Although  successful  applications  of  the  BP  algorithm  are  numerous,  neural  network 
researchers  soon  found  that  the  algorithm  has  some  fundamental  limitations.  First  of 
all,  BP  training  may  fail  to  converge.  Secondly,  BP  may  reach  only  a  local  minimum 
solution  when  it  does  converge,  as  in  any  gradient  descent  based  algorithm.  The 
local  minimum  may  or  may  not  represent  an  acceptable  solution.  Furthermore,  BP 
training  is  generally  very  slow  as  compared  to  non-neural  net  approaches.  This  has 
prevented  the  use  of  feedforward  neural  nets  from  real  time  applications. 

An  enormous  amount  of  work  has  been  done  to  improve  BP  learning  in  the  last 
few  years.  In  the  following  we  present  new  developments  in  this  area  concerning 
convergence,  generalization  and  learning  rate,  while  leaving  the  discussion  on  global 
optimal  solutions  to  Chapter  5-7.  We  consider  BP  variations  in  criterion  function, 
activation  functions,  network  structure,  second  order  training  algorithms  and  some 
heuristics. 

4.1      Performance  Criterion  Function 

We  have  used  total  sum  of  squared  (TSS)  error  as  the  performance  criterion  in 
our  discussion  in  Chapter  3.  TSS  is  the  standard  and  most  widely  used  performance 
criterion.  Besides  its  conceptual  and  implementational  simplicity,  it  has  the  advan- 
tage that  under  the  assumption  that  training  samples  are  independently  chosen  from 
a  Gaussian  distribution,  the  least  squared  error  (minimizing  TSS)  estimation  is  sta- 
tistically equivalent  to  the  maximum  likelihood  estimation  (MLE),  which  has  many 
desirable  statistical  properties  (Wang  and  Malakooti.  1989). 

Under  certain  circumstances,  TSS  may  not  be  the  best  performance  criterion. 
For  instance,  in  a  classification  problem,  the  number  of  misclassification  may  be 
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more  appropriate  than  the  TSS  criterion.  Burrascano  and  Lucci  (1990)  compared 
the  least  square  error  (L2  norm)  and  the  min-max  (L^  norm)  performance  criteria. 
The  former  is  better  if  the  data  follow  a  Gaussian  distribution,  while  the  later  should 
be  used  if  the  data  distribution  is  nearly  uniform.  The  min-max  criterion  function 
is  non-differentiable.  To  carry  out  gradient  descent  search,  a  pseudo  derivative  is 
defined  as 


dFp 
dopk 


d 


\yvk-  -  opk. 


dopk 

0     if  k  ^  k* 

-1      if  k  =  k*  and  ypk  <  opk 
+  1      if  k  =  k"  and  ypk  >  opk 

where  k*  =  argmax  \\ypk  -  opk\\,Vk.  Correspondingly,  we  have 


(4.1) 


4    = 


dFp 
dnetk 
0  if  k  ^  k" 

+opk(l  -  opk)      if  k  =  k*  and  ypk  <  opk 
,   -opk(l  -  opk)      if  k  =  k"  and  ypk  >  opk. 


(4.2) 


This  is  used  in  the  updating  rule  for  the  output  layer.  The  6/s  in  the  hidden 
layer(s)  are  not  changed.  With  the  above  modification,  the  standard  backpropagation 
algorithm  (Section  3.5)  can  be  employed.  Burrascano  and  Lucci  (1990)  reported  that 
better  performance  was  achieved  with  the  min-max  criterion  for  the  parity  problem.1 

For  classification  problems,  Hampshire  and  Waibel  (1990)  proposed  the  "classifi- 
cation figure-of-merit"  (CFM)  criterion  function,  which  is  defined  as 


CFM=     £ 


(4.3) 


k=l,kjtt  X  ^  t 

Where  ot  is  the  output  from  the  "true"  (correct  classification)  unit  and  o„  is  the 
output  from  non-true  unit.  We  observe  that  CFM  is  comprised  of  the  sum  of  sigmoid 
functions.  No  target  values  appear  in  the  criterion  function.  CFM  has  the  following 
characteristics: 


The  problem  has  an  n-bit.  binary  input.  The  output  is  one  if  the  input  string  has  even  number 
of  ones;  otherwise,  the  output  is  zero.  The  XOR  problem  discussed  in  Chapter  3  is  a  special  case  of 
parity  problem  with  two-bit  inputs. 


• 


• 
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It  requires  the  output  unit  representing  the  correct  classification  to  have  a 
higher  activation  value  than  any  other  output  units. 

It  discourages  the  network  from  learning  specific  examples,  and  encourages 
learning  a  general  representation  of  the  training  data. 

•  It  alleviates  the  problem  of  the  TSS  criterion  where  outliers  tend  to  mislead 
the  learning  process. 

Hampshire  and  Waibel  reported  slightly  better  results  were  obtained  using  CFM 
criterion  than  the  sum  of  squared  errors.  Assisted  by  an  ad-hoc  post-processing 
procedure,  the  results  from  CFM  criterion  became  significantly  better  than  those 
obtained  with  the  TSS  criterion. 

Standard  BP  uses  a  sigmoid  function  as  the  non-linear  activation  function  (see 
Figure  3.2).  The  sigmoid  function  has  an  automatic  gain  control  property.  That  is, 
when  the  activation  value  is  close  to  saturation  (-1  or  l)2,  the  output  change  corre- 
sponding to  a  input  change  is  small;  when  the  activation  value  is  far  from  saturation, 
the  output  change  corresponding  to  an  input  change  is  large.  This  property  is  im- 
portant to  the  stability  of  a  dynamic  network.  However,  the  sigmoid  nonlinearity 
hinders  the  learning  process  with  its  near-zero  derivative  over  a  large  range  of  input 
values.  This  is  easily  seen  from  the  BP  learning  rule  (for  the  output  layer): 

dFp 


dwjk 
=    v(yPk-  opk)f'{netk)or  (4.4) 

When  (yk  -  ok)  -»  0,  we  do  not  need  to  change  the  weights  as  the  target  values  are 
learned.  When  Oj  — ►  0,  there  is  no  need  to  adjust  the  corresponding  weight  wjk,  since 
wjk  has  no  effect  on  the  net  input.  But  the  case  /'  -►  0  does  not  tell  us  much.  Since 
/  (net])  =  ok(l  -  ok ),/'-+  0  whether  uk  approaches  the  target  value  (0  or  1)  or  ok 
approaches  the  opposite  of  the  target  value.    In  the  later  case,  learning  stops  with 

Activation  functions  with  a  range  in  [-1,  +1]  are  sometimes  referred  to  as  polary  functions.  As 
compared  to  binary  functions  with  range  in  [0, 1].  Polary  activation  functions  have  the  advantage 
that  a  non-zero  output  (as  input  to  the  next  layer)  may  speed  up  learning.  We  use  binary  activation 
throughout  the  dissertation  unless  explicitly  stated  otherwise. 
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a  large  error.  This  fact  increases  the  probability  that  the  neural  nets  get  stuck  in  a 
local  minimum. 

Burrascano  and  Lucci  (1990)  proposed  a  delta  rule  of  the  form 

k  ~  l  _|_  eynetk  (4-5) 

which,  contrary  to  the  standard  delta  rule,  has  larger  values  when  the  activation 
approaches  ±1.  Their  experiments  showed  that  with  the  new  delta  rule,  the  modified 
BP  algorithm  performed  slightly  better  than  the  classic  BP  algorithm.  What  is  more 
important  is  that,  the  modified  version  had  a  much  smaller  failure  fraction  than  the 
normal  BP  algorithm.  The  authors  claimed  that  the  proposed  modification  virtually 
eliminates  non-convergence  problems  if  a  moderate  learning  rate  is  applied. 

Another  alternative  to  the  sum-of-square  error  criterion  is  the  cross-entropy  per- 
formance function  defined  as: 

F  =  -EL(w^(v)  +  (1  -  yPk)log(l  -  opk)).  (4.6) 

p    k 

The  derivative  of  F  with  respect  to  opk  is 

|£  =  _M+iziEi.  (4.7) 

Oopk  opk        1  -  upk  ' 

Note  that  £J-  -»  oo  as  upk  ->  1  and,  ^  ->  -oo  as  opk  ->  0.  This  brings  a 
counteracting  effect  to  the  problem  mentioned  above,  i.e.,  learning  is  hindered  when 
the  output  approaches  saturation.  Indeed,  experiments  by  Fahlman  (1988)  showed 
that  using  the  cross  entropy  criterion,  the  learning  speed  of  a  neural  network  on 
the  encoding  problem  increased  by  50%  as  compared  to  using  the  standard  sum-of- 
squared-error  criterion. 

4.2     Momentum 

A  simple  variation  of  the  classic  backpropagation  algorithm  is  to  add  a  "momen- 
tum" term  to  the  weight  updating  formula 

OF 
Await  +  1 )  =  -//— ^  +  aAwij(t)  (4.8) 

where  a  €  [0,1).  This  modification  augments  successive  gradients  by  adding  a  fixed 
portion  of  previous  weight  changes.  The  effect  of  the  momentum  term  is  to  accelerate 
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the  weight  changes  when  successive  gradients  have  the  same  signs  and  to  slow  down 
weight  changes  when  successive  gradients  have  different  signs.  Thus,  it  helps  to  speed 
up  the  search  in  the  weight  space  where  the  down-hill  gradient  is  small,  and  to  damp 
oscillations  that  are  likely  to  occur  in  the  ravine  areas  if  only  a  fixed  learning  rate 
is  used.  Reports  (e.g.,  Chauvin,  1990)  have  shown  that  the  momentum  term  can 
speed  up  the  learning  process  significantly.  Since  the  use  of  the  momentum  was 
proposed  by  Rumelhart  et  al.  (198G),  the  authors  popularized  the  backpropagation 
algorithm,  and  it  is  used  almost  always  in  backpropagation  learning,  we  will  refer  the 
backpropagation  algorithm  with  the  use  of  momentum  as  the  standard  or  classic  BP 
algorithm  in  our  later  discussion. 

Adding  the  momentum  term  is  analogous  to  signal  smoothing.  This  observation 
led  Adams  (1991)  to  propose  using  both  past  and  the  future  information  in  momen- 
tum, analogous  to  a  symmetric  smoothing.  The  idea  is  simple:  In  the  standard  BP 
algorithm,  when  the  hidden  layer  weights  are  updated,  we  have  already  the  informa- 
tion to  compute  the  weight  change  in  the  next  iteration,  since 

6j{t  +  1)  =  0,(1  -  Oj)J2h(t  +  l)wjk(t  +  1)  (4.9) 

k 

and 

Awij(t  +  1)  =v6j{t  +  l)oi  (4.10) 

where  the  future  Sj(t  +  1)  is  obtained  through  the  newly  computed  output  layer 
weights.  Hence  the  hidden  layer  weight  updating  can  be  modified  as 

Awij(t)  =  Tj6j(t)0i  +  ctiAwijit  -l)-f  Q2AWlJ(i  +  1)  (4.11) 

where  Qj  and  a2  are  the  coefficients  corresponding  to  the  past  and  future  momentums. 
The  improvement  of  learning  speed  obtained  by  the  author  was  moderate  with  this 
modification. 

4.3     Second  Order  Methods 

Since  backpropagation  uses  a  gradient  descent  based  search,  the  rich  literature  in 
optimization  naturally  becomes  a  ready  source  of  new  algorithms  for  neural  network 
training.  Most  optimization  methods  are  based  on  the  same  strategy,  that  is,  using 
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some  iterative  process  in  which  an  approximation  of  the  criterion  function  is  mini- 
mized. Commonly  used  approximations  are  given  by  the  first  order  or  second  order 
Taylor-series  expansion,  i.e., 

F{w  +  Aw)  =  F(w)  +  AwVF  +  ■  ■  ■  (4.12) 

or 

F(w  +  Aw)  =  F(w)  +  AwVF  +  -AivTV2F(w)Aw  +  •■■  (4.13) 

where  VF  denotes  the  gradient  of  F,  and  V2F  denotes  the  Hessian  of  F.  Classic 
backpropagation  is  an  example  of  using  a  first  order  approximation.3  First  order  and 
second  order  approximations  are  also  referred  to  as  linear  and  quadratic  approxima- 
tions, respectively. 

First  order  approximations  use  only  local  gradient  and  function  values,  while 
second  order  approximations  use  also  curvature  information.  Hence  second  order 
methods  usually  have  faster  convergence.  Among  the  most  successful  applications  of 
second  order  methods  in  neural  networks  are  the  conjugate  gradient  (CG)  algorithms 
and  Newton's  methods. 

Let  us  consider  a  general  iterative  process.  Suppose  we  want  to  minimize  a  crite- 
rion function  F{w).  We  determine  a  search  direction  d,  and  a  stepsize  A(.  The  iterate 


is 

wt+i 


w1  +  Xtdt  (4.14) 

where  dt  and  \t  are  determined  such  that  F{w'+i)  <  F(w')  or  F(wt+l)  is  minimized. 
Most  optimization  algorithms  fall  into  this  framework.  They  differ  by  the  way  dt  and 
A(  are  computed.  If  dt  is  set  to  be  the  negative  gradient  -VF(ttf),  and  A,  to  be  a 
constant  q,  then  we  have  the  simple  gradient  descent  algorithm  discussed  in  Chapter 


3. 


3We  say  the  approximation  is  first  order  if  the  first  order  Taylor-series  expansion  is  used.  Simi- 
larly, the  approximation  is  second  order  if  the  second  order  Taylor-series  expansion  is  used. 
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4.3.1     Conjugate  Gradient  Methods 

Let  Fa(z)  denote  the  second  order  approximation  to  F(w)  in  the  neighborhood 
of  w. 

Fa(z)  =  F(w)  +  VF(w)z  +  ]-zTV2F{w)z.  (4.15) 

The  necessary  condition  for  Fa  to  be  minimized  is 

VFa(z)  =  VF{w)  +  V2F(w)z  =  0.  (4.16) 

At  the  current  solution  w*,  Equation  4.16  represents  a  system  of  linear  equations  with 
variable  z  (an  S  x  1  vector).  The  solution  to  this  system  of  equations  can  be  greatly 
simplified  if  a  set  of  vectors,  called  a  conjugate  system,  can  be  found. 

Definition  j.l  (Conjugate  System)  Let  dt,d2,  ...,</*  be  a  set  of  non-zero  vectors  in 
Rp,  and  A  be  a  p  x  p  nonsingular  matrix.  Then  dud2,...,dk  is  a  conjugate  system 
with  respect  to  A  if  dt,d2,  ...,dk  are  linearly  independent  and 

dfAdJ  =  0       Vi#j,  i,j  =  l,2,..,*. 

Suppose  we  have  a  conjugate  system,  dx,d2,...,ds  €  Rs  with  respect  to  V2F(w). 
Let  z*  be  a  solution  to  Equation  4.16  and  ~°  €  Bs  be  an  arbitrary  initial  point.  Since 
d\,  d2,  ...,ds  forms  a  basis  of  i?5,  then  any  vector  in  Rs  can  be  expressed  as  a  linear 
combination  of  the  conjugate  vectors. 


z'-z°  =  J2^<h  (4.17) 

where  A,  <E  R.  Multiplying  both  sides  with  djV*F(w)  gives 

s 

dJV2F(w)(z'  -  _~°)  =  £  Xidj^Fiw)^.  (4.18) 


i=i 


By  the  definition  of  d,,i  =  1,2,... ,5',  and  substituting  V2F(w)z"  with  VF(w)  (since 
z"  is  a  solution  to  Equation  4.16),  we  have 

dj(-VF(w)  -  V2F(w)z°)  =  \id?V*F(w)dj.  (4.19) 
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Solving  for  A^  gives 

,      _     dj(-VF{w)  -  V2F{w)z°) 
3    ~  dfV*F(w)dj 

dJVFa(z°) 

=  -^Wwd,.-  (4-2°) 

If  we  find  the  conjugate  system  in  S  steps,  then  we  can  determine  z*  in  S  steps  using 
the  above  equations. 

The  conjugate  vectors  d„i  =  1,2,...,  S  can  be  determined  recursively,  d]  can  be 
set  equal  to  the  negative  gradient  -VFa(;°),  and  dt  can  be  determined  as  a  linear 
combination  of  the  current  negative  gradient  -V ' Fa{zl)  and  the  previous  direction 
dt-i  (Moller,  1990).  Detailed  treatment  of  the  conjugate  gradient  algorithm  can  be 
found  in  Johansson  et  al.  (1990). 

Note  that  the  iterative  process  converges  in  S  steps  if  F(w)  is  a  quadratic  function. 
Fa(z)  then  becomes  an  exact  representation  of  F(w).  In  practice  the  conjugate 
gradient  algorithm  takes  more  than  S  steps  to  converge  since  F(w)  is  usually  not 
quadratic. 

Computing  and  storing  the  Hessian  matrix  V2F(w)  is  expensive  or  infeasible 
for  large  problems.  (They  require  0(S3)  and  0{S2)  operations,  respectively).  In 
implementing  the  CG  algorithm,  the  following  estimation  is  often  used: 

r72cv    t\,        ?F(w'  +  atdt)  -  VF(u/') 

for  some  small  at  €  R,at  >  0. 

Conjugate  gradient  methods  are  generally  regarded  as  among  the  most  efficient 
methods  for  large-scale  optimization  problems.  Johansson  et  al.  (1990)  reported 
that  their  implementation  of  CG  algorithm  outperformed  standard  BP  by  an  order 
of  magnitude  in  terms  of  training  speed.  Moller  (1990)  improved  the  CG  algorithm 
by  introducing  a  scaling  factor  that  alleviates  the  problems  caused  by  an  indefinite 
Hessian  matrix.  His  SCG  (scaled  conjugate  gradient)  algorithm  performed  better 
than  the  standard  CG  and  appeared  to  scale-up  (for  large  problems)  better  than 
standard  BP. 
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4.3.2     Newtonian  Algorithms 

Assume  that  F  is  twice  continuously  differentiable,  Newton's  method  finds  a  fixed 
point  through  the  following  iterate: 

w(t  +  1)  =  w(t)  -  a(V2F(x/;'))-1VF(U;t).  (4.22) 

Note  that  if  F  is  quadratic,  then  the  Newton's  method  converges  to  the  minimum 
in  a  single  step.  This  is  seen  by  letting  F(w)  =  \wT Aw  -  bw  where  A  is  positive 
definite,  and  a  =  1,  then  we  have  wt+1  =  wi  -  A'1  (Aw*  -  b)  =  A~lb.  Even  if  F 
is  not  quadratic,  under  reasonable  assumptions,  Newton's  method  is  guaranteed  to 
converge  to  a  local  minimum  from  an  arbitrary  initial  point  (Schneider  et  al.,  1991). 
It  also  converges  fast  when  it  reaches  the  neighborhood  of  a  solution.  However, 
Newton's  method  is  rarely  used  in  its  unmodified  form  because  of  the  cost  associated 
with  computing  the  Hessian  matrix  and  its  inverse.  Also,  the  method  works  well  only 
when  it  has  a  good  initial  solution  (Becker  and  le  Cun,  1988). 

A  class  of  modified  Newton's  method  is  called  Quasi- Newton  methods  where  the 
search  direction  is  computed  via 

dt  =  -H-'VFiw*)  (4.23) 

where  H  is  an  approximation  to  the  Hessian  matrix  Y2F{w).  The  most  successful 
Quasi-Newton  algorithm  is  the  Broyden- Fletcher- Goldfarb-Shannon  (BFGS)  algo- 
rithm. In  the  BFGS  method  H~l  is  obtained  iteratively  by 

H-1^VTH~1V  +  ^-  (4.24) 

where  s  =  /(w^1)  -  f(w%g  =  VF(w1~')  -  VF(w<).  At  each  iteration  /T1  can 
be  determined  through  two  new  vectors  s  and  </,  and  the  previous  H~\  Hence  the 
method  is  very  efficient. 

Watrous  (1987)  compared  the  BFGS  method  with  standard  BP  and  the  Davidson- 
Fletcher- Powell  algorithm,  and  reported  that  BFGS  trained  the  neural  nets  in  signif- 
icantly less  iterations  than  the  other  two  methods. 

A  critique  to  the  BFGS  algorithm  is  that  it  still  requires  the  storage  of  the  approx- 
imation to  the  Hessian  matrix,  which  is  inefficient  for  large  problems.   Also  it  loses 
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the  computational  locality  properties  of  backpropagation  where  the  weight  updating 
can  be  carried  out  in  local  units. 

Becker  and  le  Cun  (1988)  proposed  using  a  simple  diagonal  approximation  to 
the  Hessian  matrix.  They  replace  Awy  =  -f}-§^-  with  what  they  called  a  "Pseudo- 
Newton  Step" 

-'/  — 

Aw-=m^.  (4-25) 

where  p  €  R+  is  used  to  improve  the  conditioning  of  the  Hessian  matrix.  The 
magnitude  of  ft  determines  how  much  curvature  information  is  to  be  used  in  the 
weight  updating  rule. 

4.3.3     Quickpropagation 

Most  second  order  methods  are  considerably  more  difficult  to  implement  than  first 
order  methods,  especially  those  that  require  global  information.  Fahlman  (1988) 
developed  a  heuristic  algorithm  he  called  quickpropagation  (quickprop  for  short) 
based  on  two  assumptions:  (1)  the  error  (i.e.,  the  criterion  function)  surface  in  weight 
space  can  be  approximated  by  a  parabola,  and  (2)  the  change  in  the  slope  of  the  error 
surface  in  one  weight  axis  is  not  affected  by  other  weights  that  are  changing  at  the 
same  time.  Thus  each  weight  can  be  updated  independently  by  using  previous  and 
current  error  slopes,  and  previous  weight  changes  by 

9F(t) 

Aw^  =  aF{tJw   dFit)Aw(f  ~  !)•  (4-26) 

dw  dw 

This  weight  change  leads  directly  to  the  minimum  point  of  the  parabola.4  Thus  the 
quickprop  method  would  converge  very  fast  if  the  criterion  function  surface  were  near 
quadratic. 

Although  the  assumptions  are  very  crude,  the  quickprop  algorithm  turned  out  to 
be  very  effective  in  reducing  neural  net  training  time  in  many  standard  test  problems, 
including  the  parity  problem  and  the  encoding  problem.  The  good  performance  of 
quickprop  can,  we  think,  be  explained  from  the  weight  updating  formula.  Compared 

4The  formula  can  be  derived  as  follows:  Assuming  F  =  aw  +  bw2 ,  and  let  f  =  F' .  Then 
/  =  a  +  2bw,  f(i)  =  a  +  2bw(i),  for  »'  =  t  -  l,ttt -f  1.  Solving  the  three  equations  with  f(t  +  1)  set 
to  zero  gives  the  result. 
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to  standard  BP,  the  quickprop  weight  updating  rule  has  a  denominator  §£If=il_2|Ii) 

aw  ow    ' 

This  factor  is  relatively  large  when  the  weight  gradient  changes  a  lot.  Hence,  this 
results  in  a  small  stepsize.  While  in  the  flat  error  surface  areas,  the  gradient  changes 
very  little,  hence  creating  a  large  stepsize.  This  effectively  overcomes  the  problems 
with  fixed  stepsize  of  the  standard  BP  method. 

4.4     Parameter  Adjusting 

Tollenaere  (1990)  conducted  a  series  of  experiments  to  investigate  the  effect  of 
the  learning  parameters  (t?.q)  on  the  learning  speed  (measured  in  epochs).  Those 
experiments  cleared  to  some  extent  the  confusion  about  how  to  choose  the  parameters 
caused  by  conflicting  reports,  where  only  non-systematic  studies  were  carried  out. 
Some  general  conclusions  from  Tollenaere's  study  can  be  summarized  as  follows. 

•  Learning  time  decreases  exponentially  as  rj  increases  —  up  to  a  certain  point. 
After  that  point,  the  iterative  process  becomes  unstable. 

•  The  optimal  learning  rate  tj  (with  which  the  learning  time  is  the  least)  decreases 
as  momentum  a  increases  from  0  to  1. 


•  The  use  of  momentum  usually  increases  the  learning  speed  by  a  factor  of  2  to 
3. 

It  has  long  been  realized  that  part  of  the  standard  BP's  low  efficiency  is  due 
to  its  fixed  parameters.  Usually  the  parameters  need  to  be  chosen  empirically  for 
a  particular  problem.  Even  after  the  best  parameter  combination  is  found  through 
extensive  experiments,  using  those  fixed  parameters  can  not  meet  the  conflicting 
needs,  e.g.,  a  large  stepsize  is  desired  in  flat  functional  surface  area  and  a  small 
stepsize  is  required  in  areas  with  narrow  ravines. 

Numerous  dynamic  parameter  adjusting  schemes  have  been  developed.  Most  of 
them  are  heuristics  (e.g.,  Silva  and  Almeida.  1990)  and  emphasize  local  computa- 
tion (as  compared  to  the  second  order  methods  in  the  last  section  where  nonlocal 
computation  is  usually  required).  Local  processes  are  necessary  for  parallel  imple- 
mentation and  they  resembers  more  what  happens  in  the  biological  neural  system 
(Jacobs,  1988). 
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Several  principles  for  adjusting  parameters  are  given  in  Jacobs  (1988): 

1.  An  individual  learning  step  should  be  assigned  to  each  weight  (and  threshold). 

2.  The  learning  rate  (stepsize)  should  be  adjusted  according  to  the  curvature  of 
the  criterion  function  where  change  is  taking  place. 

3.  The  learning  rate  should  be  increased  when  the  current  partial  derivative  of 
the  criterion  function  with  respect  to  the  weight  in  consideration  has  the  same 
sign  as  the  previous  partial  derivative;  otherwise,  the  learning  rate  should  be 
decreased. 

Based  on  these  principles,  Jacobs  proposed  the  "delta-bar-delta"  (DBD)  learning 
rule.  A  learning  rate  ifr,  is  allocated  to  each  weight  wih  and  StJ  is  introduced  as  an 
exponentially  decaying  trace  of  the  gradient.  The  formulae  for  weight  updating  is: 

'  *  +  ifc(<-l)     if  £;(*-  l)Sa(t)  >0 
T}ii(t)  =  j   -\Vij(t  -  1)       if  ftj(t  -  l)StJ(t)  <  0  (4.27) 

.  0  otherwise 

AWij{t)  =  -TjijSijit)  +  aAwijit  -  1)  (4.28) 

^(t)  =  (l-0)SlJ(t)  +  eS~tJ(t-l)  (4.29) 

where  k,  A,  and  6  are  user  determined  parameters,  and  %  =  |£-  (it  is  slightly 
different  from  the  6tJ  in  standard  BP).  Note  that  the  increase  in  the  learning  rate 
is  additive  while  the  decrease  is  multiplicative.  This  strategy  prevents  the  learning 
rate  from  growing  too  fast  (which  may  lead  to  weight  saturation)  and  allows  it  to 
decrease  rapidly,  but  keep  a  positive  sign. 

The  DBD  algorithm  leads  to  significant  speed-up  of  the  standard  BP  algorithm. 
However,  the  algorithm  is  very  sensitive  to  the  new  parameters,  especially  k.  Also, 
while  the  momentum  term  increases  learning  speed,  it  leads  to  instability.  Sinai  and 
Williams  (1990)  proposed  several  modifications  to  the  DBD  algorithm,  and  labeled 
the  new  version  the  "Extended  Delta-Bar-Delta"  (EDBD)  algorithm.  The  main  ideas 
of  EDBD  include  the  following.  (!)  Using  an  exponential  function  of  -\6(t)\  to  scale- 
up  the  learning  rate  instead  of  using  a  constant.  This  allows  a  faster  increase  of 
77  in  flat  areas  than  in  areas  with  large  gradient,    (2)  The  momentum  is  also  made 
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adaptive.    (3)  Upper  bounds  are  put  on  both  rj  and  a.    The  new  weight  updating 
rules  becomes: 

a      ,  n  dF(t) 

Awij{t)  =  -%~^  +  aAwij(t  -  1)  (4.30) 

Vij(t  +  1)  =  Min{rfmax,  rj^t)  +  A%-(«)}  (4.31) 

a,;(f  +  1)  =  Mm{awoar,  atJ(t)  +  AafJ(*)}  (4.32) 

'  fe/e-*MOI     if  4(t  _  i)^(<)  >  o 
?«(*)  =  |   -My(0        if  %(/  -  1  )£,,(/)  <  0  (4.33) 

,  0  otherwise 

'  Ce-a-tfiWI      if  ^(/,  _  i)^.(i)  >  o 
<*.j(*)  =  I    -Xm<Xij(t)         if  4(/  -  l)^-(t)  <  0  (4.34) 

0  otherwise 

where  fc/,  A,,  7,,  fcm ,  Am ,  -ym ,  r/max  and  amax  are  parameters  furnished  by  the  user.  EDBD 
was  reported  to  provide  significant  speed-up  over  DBD  and  to  be  more  robust  on 
learning  the  logistic  function  f(x)  =  ax{  1  -  x),  a  =  3.95,  0  <  x  <  1. 

The  authors  of  EDBD  also  suggested  implementing  a  memory  and  recovery  mech- 
anism into  the  learning  algorithm.  Specifically,  the  current  best  solution  is  retained. 
A  control  parameter  <f>  G  B,  0  >  0  is  defined.  If  the  criterion  function  value  be- 
comes greater  the  (j>  times  the  best  criterion  value  retained  so  far,  then  the  search  is 
abandoned  and  restarted  from  the  current  best  point  with  attenuated  learning  rate 
and  momentum.  However,  the  experiments  on  this  idea  showed  somewhat  negative 
results. 

Davos  and  Orban's  (1988)  SAB  (self-adapting  backpropagation)  algorithm  ad- 
vocates similar  ideas.  The  algorithm  starts  without  momentum,  and  increases  the 
learning  rate  exponentially  as  long  as  the  weight  gradient  keeps  the  same  sign.  It  dif- 
fers from  the  EDBD  algorithm  in  that  when  the  weight  gradient  changes  sign,  instead 
of  reducing  17^  by  some  rule,  it  is  reset  to  its  starting  value,  and  then  the  algorithm 
seeks  optimum  weights  using  ideas  similar  to  that  of  the  quickprop  method.  This 
later  stage  is  accompanied  by  the  use  of  momentum,  and  continues  for  a  few  steps 
before  the  search  without  momentum  is  resumed. 
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Tollenaere  (1990)  modified  the  SAB  method  and  named  his  version  SuperSAB. 
The  motivation  behind  SuperSAB  is  that  whenever  the  gradient  changes  sign,  the 
weights  should  not  be  changed.  The  weight  change  halts  until  the  stepsize  is  reduced 
to  such  an  extent  that  a  step  can  be  taken  without  changing  the  sign  of  the  gradient. 
The  learning  rate  changes  simply  by 

7«(i  +  l)  «»/+%•(*)  (4.35) 

and 

M*  +  1)  =  »W(0  (4-36) 

where  r)+  and  ??_  are  the  increase  factor  and  the  decrease  factor,  respectively.  Tol- 
lenaere reported  that  SuperSAB  is  insensitive  to  the  parameters,  and  tj+  =  1.05  and 
tj-  =  2  are  shown  to  be  good  for  a  wide  variety  of  problems. 

Compared  with  standard  BP  method,  SuperSAB  learning  is  significantly  faster. 
One  important  feature  of  SuperSAB  is  the  range  of  the  initial  stepsize  that  leads  to 
reasonably  fast  learning  (Tollenaere  referred  to  it  as  osr  —  optimal  stepsize  region) 
is  orders  of  magnitude  wider  than  that  of  standard  BP.  A  drawback  of  SuperSAB 
is  that  it  is  slightly  more  instable  than  BP.  But  it  was  argued  that  SuperSAB  with 
restart  —  after  divergence  was  detected  —  still  outperformed  standard  BP. 

An  interesting  and  important  observation  Tollenaere  made  is  that  the  optimum 
stepsize  region  of  different  learning  algorithms  do  not  necessarily  overlap.  Thus,  com- 
parison of  different  algorithms  based  on  the  same  parameter  values  are  inappropriate. 

An  idea  similar  to  SuperSAB  was  used  by  Silva  and  Almeida  (1990)  in  their 
Adaptive  Backpropagation  Algorithm  (ABA).  However,  Silva  and  Almeida  studied 
the  effectiveness  of  the  algorithm  in  the  context  of  varying  criterion  surface  orientation 
in  the  input  space.  They  argued  that  because  an  individual  learning  rate  is  used  for 
each  weight,  the  performance  of  the  method  may  be  affected  by  the  orientation  of  the 
criterion  surface  relative  to  the  coordinate  axes.  Indeed,  their  results  showed  that 
although  ABA  outperformed  standard  BP  in  virtually  all  test  cases,  the  speed-up 
was  orders  of  magnitude  higher  when  the  valleys  of  the  criterion  surface  are  aligned 
with  the  coordinates  of  the  input. 
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Chan  and  Shatin  (1990)  used  the  angle  0(t)  between  consecutive  weight  gradients, 
instead  of  sign,  to  detect  the  curvature  of  the  criterion  surface  in  the  weight  space. 
Only  a  global  learning  rate  is  used,  and  it  is  adapted  by 

r,(t)  =  r](t-\)(l+1-cos6(t)).  (4.37) 

The  momentum  is  also  made  adaptive  in  their  algorithm  by 

a(t)  =  \(t)V(t)  (4.38) 

with 

where  A0  €  (0, 1).  This  in  effect  attenuates  the  momentum  term  such  that  it  never 
exceeds  the  current  gradient  term,  hence  will  not  dominate  the  effect  by  the  current 
weight  gradient.  The  weight  updating  rule  is  then 

Aw(t)  =  Tj{t)(—^1)  +  a(t)&w(t  -  1).  (4.40) 

A  backtracking  heuristic  is  also  implemented.  The  learning  rate  rj(t)  is  reduced 
by  half  whenever  the  criterion  value  F(t)  is  greater  than  the  previous  one  F(t  -  1) 
by  a  certain  percentage  (say,  1%). 

Chen  and  Shatin's  Adaptive  Training  Algorithm  (ATA)  was  tested  against  the 
Delta-Bar-Delta  algorithm  and  a  conjugate  gradient  algorithm  on  the  XOR  problem 
and  the  4-2-4  encoding  problem  (It  will  be  discussed  in  Chapter  5.  See  also  Rumelhart 
et  al..  1986).  ATA  was  shown  to  learn  much  faster  than  the  other  two  algorithms 
and  was  insensitive  to  initial  parameters  (although  it  still  suffered  the  local  minimum 
problem  as  the  others  did). 

4.5     Activation  Functions 
Activation  functions  play  an  important  role  in  neural  net  learning.    The  most 
commonly  used  activation  function  is  the  sigmoid  function.  The  following  discusses 
some  variations. 
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4.5.1     Radial  Basis  Functions 

Powell  (1985)  introduced  the  radial  basis  function  (RBF)  for  multivariate  inter- 
polation problems.  Learning  in  supervised  feedforward  neural  nets  can  be  viewed 
as  surface  interpolation.  This  observation  led  to  the  use  of  radial  basis  function  as 
the  activation  function  in  neural  nets  by  Broomhead  and  Lowe  (1988),  Moody  and 
Darken  (1989),  and  Poggio  and  Girosi  (1990). 

Standard  feedforward  neural  networks  use  sigmoid  activation  functions.  The  net 
input  (£u;,j£,)  to  each  processing  unit  forms  a  hyperplane.  Multilayer  perceptrons 
partition  the  input  space  with  the  hyperplanes  from  each  unit,  while  in  a  feedforward 
neural  net  those  hyperplanes  are  smoothed  through  the  sigmoid  nonlinear  filter  before 
being  used  to  form  a  decision  region  (partition).  The  radial  basis  function  forms 
hyperellipsoid  regions  in  the  input  space.  A  RBF  network  consists  of  two  layers  (see 
Figure  4.1).  Each  hidden  unit  has  a  radial  basis  function  <j> :  B+  -»  R  defined  by 


&(*)  =  <M||a:-^||),       i=l,2,...,tf 


(4.41) 


where  fi,  €  R,  i  =  1,2,  ...,7V  are  parameters,  and  N  is  the  number  of  radial  basis 
centers.  ||x  -  //,||  measures  the  distance  from  the  input  vector  x  to  the  radial  basis 
function  center  /*,-.  The  network  output  at  node  k  is 


N 


fk{w,x)  =  52wik<f>i(\\x  -  fli\ 


(4.42) 


:=1 


A  frequently  used  radial  basis  function  is  the  Gaussian  function 


</>i{x)  =  e 


-  F-« 


where 


k-/Xi||2  =  (ar-/x,)TSt(x-/x,). 


(4.43) 


(4.44) 


To  simplify  computation,  the  covariance  matrix  £  is  usually  chosen  to  be  a  diagonal 
matrix 


v 


ah      0 

0      al 


»2 


0       0 


0 
0 


iN 


(4.45) 


58 


Wjk 


X1 


X2 


X3 


Figure  4.1.  A  3  x  4  x  2  radial  basis  function  network 


When  the  radial  basis  center,  /*,-,»  =  1,2,...,  A7  are  fixed  at  data  points  x\i  = 
1,2,...,7V,  what  is  left  to  the  network  to  learn  is  then  only  the  linear  coefficients  of 
u>tit  in  the  output  layer.  In  this  case,  RBF  networks  can  be  trained  very  fast  and 
without  suffering  the  problem  of  local  minima.  Moody  and  Darken  (1989)  reported 
their  RBF  network  reduced  training  time  on  learning  the  Mackey- Glass  equation5  by 
a  factor  of  102  to  103  compared  with  standard  BP. 

However,  RBF  network  is  not  appropriate  for  large  data  sets  as  the  size  of  the  net- 
work grows  with  the  number  of  training  instances.  Poggio  and  Girosi  (1989)  proposed 
to  treat  the  radial  basis  center  as  variables,  and  neural  nets  are  allowed  to  estimate 
the  centers  (ij,j  =  1,2,  ...,A',  where  K  may  be  much  less  than  N  (the  number  of 
data  points).  They  called  the  extension  Generalized  Radial  Basis  Function  (GRBF) 
network.  A  very  rigorous  and  thorough  treatment  of  RBF  and  GRBF  networks  is 
given  in  Poggio  and  Girosi  (1989). 


5The  Mackey-Glass  equation  is  defined  by  2^1  =   pffi^  -  bx(t).    It  is  highly  chaotic  for 
some  value  of  the  constant  a,  b  and  r. 
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4.5.2     Transcendental  Functions 

Although  the  sigmoid  and  the  hyperbolic  tangent  functions  have  been  the  most 
frequently  used  activation  function  in  feedforward  neural  nets,  other  monotonic,  dif- 
ferentiable  functions  can  also  be  used  (Cybenko,  1989).  In  particular,  we  have  tested 
using  transcendental  functions,  such  as  sine  or  cosine  function  as  the  activation  func- 
tion.  The  XOR  problem  can  be  solved  in  a  few  iterations  with  the  new  activation 
function.  Rosen  et  al.  (1990)  reported  that  their  neural  nets  using  sine  and/or  cosine 
activation  function  outperformed  the  standard  BP  on  the  parity  problem  (n  <  4) 
and  learning  x2  and  x3  functions.   A  justification  suggested  by  Rosen  et  al.   is  that 
transcendental  functions  can  be  expanded  (via  Taylor-series  expansion)  as  the  sum 
of  infinite  order  polynomials.  Although  the  polynomials  are  not  independent  within 
each  activation  function,  in  a  multilayer  network  the  weighted  sum  of  outputs  from 
the  hidden  units  in  effect  produces  a  weighted  sum  of  infinite  order  polynomials.  But 
sigmoid  function  can  also  be  expanded  to  a  sum  of  polynomials.    Experiments  by 
Lapedes  and  Farber's  (1987)  showed  that  trigonometric  activation  functions  are  less 
robust  than  the  sigmoid  function. 

4-5-3 Higher  Order  Networks  and  Function-link  Networks 

Instead  of  using  the  sum  of  weighted  inputs  as  net  input,  some  researchers  (Pineda, 
1988)  have  explored  the  use  of  net  input  with  higher  order  correlations  among  the  in- 
puts (e.g.,  higher  order  links  may  be  created  that  take  the  product  of  input  variables 
as  input).  The  correlations  are  usually  captured  by  the  cross  terms  of  a  polynomial. 
Volper  and  Hampson  used  quadratic  terms,  in  particular,  and  concluded  that  higher 
order  network  can  be  trained  noticeably  faster  than  the  standard  network.  Durbin 
and  Rumelhart  (1989)  studied  net  input  using  product  forms,  and  called  those  pro- 
cessing units  product  units.  Their  conclusion  was  that  product  units  could  be  a 
computationally  powerful  extension  to  the  standard  network. 

Pao  (1989)  generalized  the  idea  of  higher  order  links  to  that  of  functional  links, 
where  new  inputs  are  created  that  represent  functionals  (a  special  case  being  the 
polynomial)  of  the  original  input  pattern.  The  newly  generated  inputs  are  fed  into 
the  network  together  with  original  inputs.    Since  the  functionals  can  be  arbitrarily 
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Figure  4.2.  A  function-link  neural  network  used  to  solve  Parity  3 


complex,  this  creates  a  powerful  method  that  usually  permits  simple  networks  without 
hidden  layers  to  solve  hard  problems.  A  function-link  network  that  solves  the  parity 
3  problem  is  shown  in  Figure  4.2. 

This  functional  network  outperformed  a  3  x  3  x  1  standard  feedforward  neural 
network  by  nearly  an  order  of  magnitude.  The  efficacy  of  function-link  neural  nets 
were  also  shown  through  learning  functions  of  one  and  two  variables. 

4.5.4     Gradient  Descent  Search  in  Function  Space 

Instead  of  using  fixed  activation  functions  in  the  processing  units,  Mani  (1990) 
considered  providing  a  pool  of  functions  to  the  processing  units  and  let  the  learning 
algorithm  decide  which  of  the  candidate  activation  functions  are  the  best  to  use. 
(Different  function  pools  may  provided  to  different  processing  units).  The  learning 
procedure  he  proposed  is  similar  to  that  of  the  standard  BP.  But  now  the  gradient 
descent  is  applied  in  the  function  space,  rather  than  the  weight  space  (though  the 
two  might  be  combined  as  suggested  by  Mani). 

Unfortunately,  the  order  of  a  set  of  general  functions  can  not  be  readily  defined, 
hence  the  function  gradients  are  not  easily  obtained.  This  difficulty  makes  the  ap- 
proach more  ideological  than  practical.  The  only  problem  the  author  attempted  to 
solve  is  the  XOR  problem,  which  was  solved  with  an  activation  function  pool  made 
of  ordered  boolean  functions  {on,  and,  or,    and  off)  (Mani,  1990). 
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4.6     Dynamically  Constructed  Neural  Nets 

The  algorithms  we  have  discussed  so  far  apply  only  to  neural  nets  with  fixed 
structures.  That  is,  the  number  of  hidden  processing  units,  the  connections  between 
the  units,  and  the  layout  of  the  network  are  determined  before  the  training  algorithms 
are  applied.  Many  researchers  have  realized  that  there  are  drawbacks  with  fixed 
neural  net  structures  (see  Honavar  and  Uhr,  1988;  Tenorio  and  Lee,  1989;  Frean, 
1990).  For  any  particular  problem  we  want  to  solve,  some  neural  net  structures  are 
more  appropriate  than  others.  Since  there  is  no  general  guidelines  as  how  a  neural 
net  should  be  designed  for  a  given  problem,  it  has  been  a  common  practice  for  neural 
net  users  to  copy  neural  net  structure  from  other  applications  (without  questioning 
the  validity),  or  simply  make  up  one  arbitrarily.  This  is  hardly  a  scientific  approach 
even  though  success  may  have  been  claimed  . 

Generally,  small  neural  nets  are  preferred,  given  that  they  are  large  enough  to  be 
capable  of  solving  the  problem  at  hand.  The  rationales  are  that:  (1)  parsimony  is 
always  desirable;  (2)  neural  nets  with  fewer  parameters  are  easier  to  interpret,  when 
interpretation  is  necessary;  (3)  small-sized  networks  can  be  trained  more  reliably  given 
a  fix-sized  training  sample  (see,  e.g.,  Haussler,  1991);  and  (4)  neural  nets  with  fewer 
hidden  units  seem  to  generalize  better  with  novel  patterns  (Kruschke  and  Movellan, 
1991).  Although  the  general  representation  theorem  (see  Chapter  3)  guarantees  that 
a  feedforward  neural  network  with  a  single  hidden  layer  is  sufficient  for  learning 
practically  any  input-output  mapping,  there  is  no  theoretical  result  yet  that  specifies 
how  many  hidden  units  are  needed. 

Honavar  and  Uhr  (1988)  pointed  out  that  it  is  desirable  to  restrict  the  fan-in  and 
fan-out6  sizes  to  create  local  receptive  fields.  Then  the  number  of  hidden  units  in 
each  layer  is  limited,  and  multiple  hidden  layers  become  necessary  to  learn  a  desired 
mapping.  Indeed,  experiments  conducted  by  Gorman  and  Sejnowski  (1988)  suggested 
that  it  is  easier  to  train  a  neural  net  with  multiple  hidden  layers  than  neural  nets 
with  many  hidden  units  in  a  single  hidden  layer.  A  new  question  is  raised  then,  how 
many  hidden  layers  to  use? 

6Fan-in  is  the  number  of  connection  links  leading  to  a  processing  unit;  similarly,  fan-out  is  the 
number  of  connection  links  leaving  a  unit. 
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Two  broad  approaches  have  been  employed  to  construct  neural  nets  with  optimal 
(or  appropriate)  size.  The  first  is  to  start  with  a  small  network,  and  let  it  grow 
as  needed.  The  second  approach  is  to  train  an  excessively  large  (estimated)  net- 
work, and  then  prune  away  units  that  do  not  have  significant  impact  on  the  network 
performance. 

4.6.1     Network  Growing  Methods 

Fahlman  and  Lebiere  (1990)  identified  two  major  problems  that  contribute  to 
the  inefficiency  of  the  standard  BP:  the  step-size  problem  and  the  moving  target 
problem.  The  first  problem  has  been  covered  in  a  previous  section.  It  is  the  second 
problem  that  is  caused  by  the  fixed  structure  of  a  neural  net.  In  such  a  network 
the  hidden  units  have  no  communication  with  one  another,  as  no  lateral  connections 
are  provided.  During  the  training  process,  each  hidden  unit  modifies  its  link  weights 
according  to  the  error  signal  backpropagated  from  the  output  layer.  The  problem  is 
that  all  units  are  trying  to  learn  the  same  training  pattern  at  the  same  time.  As  the 
training  pattern  changes  constantly  (for  instance  training,  the  most  common  case), 
it  takes  a  long  time  for  the  hidden  units  to  split  their  roles  and  to  commit  to  different 
patterns. 

A  possible  way  to  combat  the  moving  target  effect  is  to  train  part  of  the  network 
at  a  time.  The  cascade-correlation  algorithm  developed  by  Fahlman  and  Lebiere  uses 
this  approach  to  its  extreme.  Only  one  hidden  unit  (including  associated  weights  and 
bias)  is  allowed  to  change  at  any  stage  of  the  training  process. 

The  cascade-correlation  algorithm  starts  with  a  feedforward  neural  network  with- 
out a  hidden  layer.  The  algorithm  builds  up  the  network  (the  cascade  architecture) 
by  adding  hidden  units  one  at  a  time.  Whenever  a  hidden  unit  is  added,  it  forms  a 
new  hidden  layer  with  connections  from  all  input  units  and  previous  added  hidden 
units. 

The  decision  to  add  new  hidden  units  is  made  when  the  training  proceeds  without 
significant  error  reduction,  and  the  error  is  still  greater  than  the  stopping  criterion. 
Before  a  new  hidden  unit  is  added  to  the  net,  it  is  connected  to  all  input  and  previous 
hidden  units.  The  output  of  the  candidate  hidden  unit  is  computed  over  all  training 
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patterns,  and  the  covariance  S  of  the  hidden  unit  output  Vp  and  the  current  network 
error  is  maximized.  S  is  given  by 

S  =  E\E(Vv-y)(^-Ek)\  (4.46) 

k         P 

where  k  is  the  output  unit  index,  and  V  and  Ek  are  averages  over  all  p  patterns. 
The  weights  leading  to  the  candidate  hidden  unit  are  modified  to  maximize  S  with 
a  gradient  ascent  algorithm  similar  to  that  of  backpropagation.  When  these  weights 
converge  (the  maximization  problem  is  solved),  they  are  frozen,  and  added  to  the 
current  net  with  the  candidate  hidden  unit.  Then  the  training  of  the  net  resumes 
until  the  stopping  criterion  is  met  or  new  hidden  units  are  needed.7 

A  number  of  benchmark  test  problems  were  performed  by  Fahlman  and  Lebiere 
(1990).  They  reported  that  the  cascade-correlation  algorithm  beat  quickprop8  by 
a  factor  of  5  and  standard  BP  by  a  factor  of  10  on  the  two-spirals  problem.9  On 
the  8-bit  parity  problem,  the  cascade-correlation  algorithm  not  only  outperformed 
the  standard  BP  by  a  factor  of  5,  but  it  also  built  a  much  more  compact  network. 
Furthermore,  it  was  shown  to  generalize  well  on  the  10-bit  parity  problem. 

Frean  (1990)  developed  an  interesting  net  growing  algorithm  —  the  Upstart  Algo- 
rithm. The  algorithm  deals  with  multilayer  perceptrons,  i.e.,  feedforward  neural  nets 
with  threshold  processing  units.  It  creates  new  units,  called  daughters,  that  correct 
the  errors  made  by  each  parent  unit.  The  algorithm  proceeds  recursively  creating 
new  daughters  units  until  none  of  the  terminal  (the  leaf)  daughters  makes  any  mis- 
takes. In  other  words,  the  Upstart  algorithm  expands  the  network  until  the  problem 
is  solved.  Convergence  to  zero  error  is  guaranteed  for  learning  boolean  functions. 
Tests  on  the  parity  problem  showed  that  the  Upstart  algorithm  was  efficient.  It 
solved  the  n-bit  parity  problem  with  n  less  than  10  in  less  than  1000  iterations.  The 

At  first  glance,  this  approach  seems  anti-connectionist.  But,  we  need  to  realize  that  sequential 
processing  may  constitute  an  important  part  even  in  a  connectionist  system.  The  success  of  the 
cascade-correlation  algorithm  provides  strong  evidence  for  that  argument. 

8Quickprop  is  also  implemented  within  the  cascade-correlation  algorithm. 

9The  two-spirals  problem  has  two  real-valued  inputs  and  a  single  binary  output  that  identifies 
the  classification  of  the  inputs.  The  inputs  are  points  on  two  interlocking  spirals  that  go  around  the 
origin.  The  task  is  to  tell  the  two  spirals  apart  through  learning  from  the  examples  in  the  training 
set.  This  is  a  very  hard  problem  because  the  two  classes  are  intertwind. 
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algorithm  probably  doesn't  scale-up  well  since  it  took  more  than  10,000  iterations  to 
solve  the  10-bit  parity  problem. 

The  SONN  (Self  Organizing  Neural  Net)  algorithm  proposed  by  Tenorio  and  Lee 
(1989)  was  designed  for  system  identification  problems.  A  new  node  is  generated  with 
polynomial  activation  functions  of  all  inputs  and  outputs  from  previous  layers.  The 
polynomial  is  limited  to  order  two.  Thus  each  new  unit  has  at  most  two  parent  units. 
The  best  polynomial  functions  is  determined  by  a  Structure  Estimation  Criterion 
(SEC)  which  provides  a  trade-off  between  performance  and  complexity  of  the  model. 
Simulated  annealing  is  used  in  the  search  process.  When  applied  to  learn  the  Mackey- 
Glass  (see  footnote  on  page  58)  time  series,  the  SONN  algorithm  produced  far  more 
compact  models  (net  structures)  than  the  standard  feedforward  neural  networks  used 
for  comparable  performance. 

Hirose  et  al.  (1991)  considered  some  heuristics  that  perform  both  growing  and 
pruning  of  the  feedforward  neural  nets.  The  performance  criterion  F  (in  this  case,  the 
sum  of  squared  errors)  is  checked  every  100  weight  updatings.  If  F  fails  to  decrease 
by  more  than  one  percent  of  the  previously  checked  value,  a  new  unit  is  added  to  the 
hidden  layer.  When  a  network  is  successfully  trained,  the  pruning  process  is  envoked 
which  simply  removes  one  hidden  unit  at  a  time,  and  then  restarts  training  of  the 
reduced  network  until  no  more  hidden  units  can  be  removed.  This  occurs  when  the 
net  fails  to  converge  with  a  unit  removed.  These  heuristics  appear  very  crude,  but 
they  do  help  to  overcome  the  non-convergent  problem.  The  authors  even  claimed 
that  their  heuristics  could  avoid  local  minimal  solutions. 

4.6.2     Network  Pruning 

The  network  growing  methods  usually  have  a  goal  to  minimize  the  net  size.  How- 
ever, there  are  also  reasons  to  train  a  neural  network  with  a  larger  than  minimum  size. 
Extra  hidden  units  may  increase  the  robustness  (performing  well  in  noisy  environ- 
ments, see,  e.g.,  Sietsma  and  Dow,  1991)  and  improve  fault  tolerance.  Furthermore, 
a  minimum  net  usually  is  difficult  to  train  although  it  is  theoretically  sufficient  to 
learn  the  task.  The  chance  of  getting  stuck  in  local  minima  is  greatly  reduced  by 
using  extra  hidden  units  (Rumelhart  et  al.,  1986;  Sietsma  and  Dow,  1987). 
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Thus  many  researchers  studied  pruning  the  nets  after  they  are  trained  with  a 
sufficiently  large  number  of  hidden  units  (Mozer  and  Smolensky,  1989;  Karnin,  1990). 
Sietsma  and  Dow  (1987)  proposed  a  two-stage  pruning  method.  In  the  first  stage, 
the  output  of  the  hidden  units  of  a  trained  net  are  analyzed.  Those  hidden  units 
whose  output  do  not  change  for  all  input  patterns  are  removed.  If  two  hidden  units 
have  the  same  or  opposite  outputs  across  all  training  patterns,  then  one  of  them 
may  be  removed.  In  the  second  stage,  the  contribution  of  each  hidden  unit  to  the 
learning  task  (classification)  is  analyzed.  The  redundant  units  and  hidden  layer(s) 
are  removed.  The  resultant  is  a  much  smaller  net  that  can  be  trained  quickly.  The 
interesting  fact  is  that  a  net  with  the  same  size  as  the  net  obtained  from  pruning 
could  not  be  trained  starting  with  random  weights.  Karnin  (1990)  used  a  similar 
pruning  procedure  where  the  hidden  units  are  ordered  by  the  amount  of  global  error 
{F)  changed  when  the  unit  is  pruned.  Those  units  with  negligible  effects  on  the 
global  error  are  removed. 

Sankar  and  Mammone  (1991a)  proposed  a  new  neural  net  architecture  called  the 
Neural  Tree  Network  (NTN)  which  combines  feedforward  neural  nets  with  decision 
trees.   A  feedforward  neural  net  is  used  at  the  root  node  of  the  NTN  to  divide  the 
instance  space  into  N  subsets,  where  N  is  the  number  of  concept  (output)  classes.  If 
each  subset  corresponds  to  a  single  concept  class,  then  the  job  is  done.   Otherwise, 
each  of  those  subsets  with  non-unique  concept  classes  is  assigned  to  a  child  node, 
where  again  a  feedforward  neural  net  is  used  to  divide  the  subset  further.     This 
process  continues  until  each  subset  contains  only  instances  from  a  single  class.  It  has 
been  reported  that  when  feedforward  neural  nets  are  compared  with  decision  trees  for 
classification,  neural  nets  usually  give  smaller  classification  errors  but  take  a  longer 
time  to  learn  (Tsoi  and  Pearson,  1990;  Fisher  and  McKusick,  1989;  Piramuthu  et 
al.,  1990).  Sankar  and  Mammone  showed  that  NTN  outperformed  both  feedforward 
neural  nets  and  decision  trees  for  a  speaker  independent  vowel  recognition  task. 

A  pruning  procedure  for  the  NTN  algorithm  was  developed  and  shown  to  be 
optimal  in  the  sense  that  the  pruning  procedure  generates  a  subtree  that  minimizes 
the  Lagrangian  cost  function  consisting  of  expected  misclassification  and  a  penalty 
term  for  the  network  size.  When  the  penalty  coefficient,  a,  is  set  to  zero,  the  optimal 
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pruned  subtree  is  the  NTN  itself.     As  a  increases,  the  optimally  pruned  subtree 
reduces  in  size  with  the  root  node  as  a  limit  (Sankar  and  Mammone,  1991b). 

Weigend  et  al.  (1990)  used  the  information  theoretic  concept  of  "minimum  de- 
scription length"  (as  in  the  SONN  algorithm  by  Tenorio  and  Lee,  1989).  A  penalty 
for  the  network  complexity  measured  in  number  of  connections  was  added  to  the 
criterion  function.  Thus,  by  minimizing  the  augmented  criterion  function  through 
standard  BP,  a  trade-off  is  achieved  between  the  performance  and  the  network  com- 
plexity. This  approach  led  to  a  reduced  size  of  the  trained  network  and  improved  its 
generalization  property. 

Similar  pruning  approaches  were  discussed  in  the  "'Skeletonization"  procedure 
(Mozer  and  Smolensky,  1989)  and  "Optimal  Brain  Damage"  method  (le  Cun  et  al., 
1990).  Chauvin  (1989)  used  a  penalty  term  for  large  weights  in  the  criterion  function. 
Hanson  and  Pratt  (1989)  defined  a  bias  term  in  the  criterion  function  that  served  to 
decay  the  weights  (pushing  the  weights  not  increased  by  the  updating  rule  to  zero), 
and  obtained  trained  nets  with  smaller  numbers  of  hidden  units. 

The  GAL  (grow  and  learn)  algorithm  introduced  by  Alpaydin  (1991)  can  both 
grow  and  prune  the  net.  It  is  basically  a  variant  of  the  nearest  neighbor  method, 
which,  instead  of  storing  the  whole  training  set,  stores  only  a  subset  of  the  training 
set  with  training  patterns  close  to  class  boundaries.  A  recent  summary  of  dynamic 
structured  neural  nets  can  also  be  found  in  Alpaydin  (1991). 

Contradictory  to  common  belief,  Sietsma  and  Dow  (1991)  showed  that  for  the 
classification  problems  they  attempted,  pruning  to  the  minimum  number  of  hidden 
units  decreased  the  generalization  ability  of  feedforward  neural  nets  in  noisy  environ- 
ment, although  the  pruned  nets  did  very  well  on  the  training  set. 

4.7     Miscellenous  Heuristics 

There  are  many  variations  of  the  standard  BP  that  do  not  fit  in  the  sections 
presented  above.  We  outline  some  of  the  more  influential  ones  in  the  following. 
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4.7.1     Initial  Weights 

In  most  nonlinear  optimization  problems,  identifying  a  good  initial  solution  could 
be  crucial  to  the  efficiency  of  the  algorithm.  Similarly,  initial  weights  in  neural  nets 
play  an  important  role  in  network  training. 

Kolen  and  Pollack  (1990)  performed  extensive  tests  on  the  sensitivity  of  back- 
propagation  to  initial  network  weights.  Their  results  showed  that  standard  BP  is 
very  sensitive  to  the  initial  weight  range.  Specifically,  for  the  2  x  2  x  1  XOR  net,  BP 
gets  stuck  in  local  minima  easily  when  the  range  of  initial  weights  was  set  to  larger 
than  2. 

Chen  and  Bastani  (1989)  introduced  a  weight  initialization  algorithm  for  two- 
layer  feedforward  neural  nets.  A  least  squared  error  (LSE)  feature  selection  method 
called  the  Walsh  Transform  is  used.  What  the  Walsh  transform  does  is  producing 
an  initial  weight  matrix  that  has  the  best  projection  from  the  training  sample.  The 
learning  speed  of  the  XOR  network  with  the  use  of  this  weight  initialization  technique 
was  shown  to  be  much  higher  than  the  same  network  with  random  initial  weights. 
Specifically,  networks  so  initialized  performed  nearly  as  well  as  the  best  randomly 
initialized  networks  from  150  tests. 

4.7.2     Multi-scale  Training 

Felten  et  al.  (1990)  also  considered  incorporating  features  of  the  problem  into 
the  neural  net  weight  space.  They  reasoned  that  it  is  only  natural  to  use  any  knowl- 
edge about  the  training  set  in  order  to  restrict  the  search  space  (hypothesis  space). 
Since  real  world  problems  are  inherently  structured,  it  is  possible  to  incorporate  the 
information  into  neural  network  learning.  Specifically,  they  proposed  a  multi-scale 
training  algorithm.  It  starts  with  small  networks,  and  then  uses  the  results  from  the 
trained  small  networks  to  help  train  a  larger  network.  The  networks  of  different  size 
are  related  through  the  rescaling  or  dilation  operator.  For  a  hand-written  character 
recognition  task,  multi-scale  training  learned  the  training  set  more  than  5  times  faster 
than  one  stage  training  and  generalized  to  novel  characters  better. 
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4.7.3     Borderline  Patterns 


Ahmad  and  Tesauro  (1988)  found  that  the  number  of  training  examples  needed  to 
train  a  neural  net  successfully  scales  linearly  with  the  number  of  inputs  for  learning 
the  majority  function.10  More  importantly,  the  most  useful  training  instances  are 
those  close  to  the  class  boundary.  Thus  they  proposed  to  use  only  borderline  patterns 
to  train  the  neural  nets.  Their  experiments  showed  that  nets  trained  with  borderline 
patterns  performed  significantly  better  than  nets  trained  with  random  patterns.  They 
also  had  a  substantially  better  generalization  ability.  An  upper  bound  on  the  number 
of  random  training  patterns  sufficient  to  learn  the  majority  function  was  derived  based 
on  the  borderline  pattern  notion. 

4.7.4     Rescaling  of  Error  Signal 

Rigler  et  al.  (1991),  besides  providing  a  general  account  of  gradient  descent 
methods,  noted  that  in  a  feedforward  neural  net  with  sigmoid  activation  functions, 
the  BP  algorithm  generates  a  factor  /'  =  o(l  -  o)  <  1/4.  Hence  by  the  chain 
rule  the  gradient  vectors  in  different  layers  contain  exponentially  decreasing  factors 
(1/4, 1/16, 1/64, ...).  To  compensate  this  diminishing  effect,  they  suggested  rescaling 
the  gradient  factor,  that  is,  multiplying  the  gradient  factor  with  exponentially  in- 
creasing scalars,  One  particular  set  of  rescalings  they  used  was  6, 36, 216, ...,  obtained 
from  taking  the  inverse  of  the  expected  diminishing  factors.  Experiments  showed 
that  this  simple  rescaling  method  could  reduce  training  time  by  as  much  as  an  order 
of  magnitude. 

Fahlman  (1988)  called  /'  the  sigmoid  prime  function.  We  have  discussed  that 
the  value  of  the  sigmoid  prime  function  goes  to  zero  when  the  output  approaches  0 
or  1,  This  also  causes  the  backpropagation  error  signal  to  become  vanishingly  small, 
hence  learning  is  slowed  down.  By  simply  adding  a  constant  0.1  to  the  sigmoid  prime 
function  before  it  is  used,  Fahlman  reduced  the  training  time  to  nearly  half  of  that 
without  this  modification.  Even  replacing  the  sigmoid  prime  function  with  a  random 
value  in  (0, 1)  greatly  reduced  the  training  time. 

10The  majority  function  is  a  boolean  function  with  an  odd  number  of  binary  inputs.  The  output 
is  one  if  more  than  half  of  the  inputs  are  one. 
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4.7.5  Varying  the  Gain  Factor 

Kruschke  and  Movellan  (1991)  performed  gradient  descent  with  respect  to  the  gain 
factor,  hence  making  it  adaptive.  The  adaptive  gain  factor  modifies  the  magnitude 
of  the  weight  change,  and  creates  an  effect  similar  to  that  of  an  adaptive  learning 
rate.  The  BPG  (backpropagation  with  adaptive  gain)  algorithm  was  shown  to  give  a 
remarkable  speed-up  (by  a  factor  of  about  2)  over  standard  BR  The  gain  factor  was 
also  used  to  create  hidden  layer  bottlenecks  (reducing  the  number  of  hidden  units) 
for  improving  generalization. 

4.7.6  Divide  and  Conquer 

The  divide-and-conquer  strategy  has  a  long  tradition  in  computer  science  and 
artificial  intelligent  systems.  Jacobs  (1990)  developed  a  theory  and  methodology 
of  a  modular  connectionist  architecture.  Similarly,  Thrun  et  al.  (1991)  studied 
task  modularization  through  network  modulation.  Fox  et  al.  (1991)  proposed  a 
method  that  combines  Kohonen's  feature  map  (Kohonen,  1989)  with  the  feedforward 
neural  nets,  and  developed  an  error-driven  decomposition  scheme  that  was  shown  to 
outperform  the  feature  map  or  backpropagation  alone  in  approximating  the  Mexican 
hat  function.11 

Pratt  et  al  (1991)  studied  direct  transfer  of  learned  information  among  neural 
nets.  They  were  able  to  train  a  large  net  starting  with  weights  transferred  from  a 
smaller  net  trained  on  subtasks.  Compared  with  nets  using  random  initial  weights, 
the  weight-preset  nets  achieved  speed-ups  of  up  to  an  order  of  magnitude  (even  if 
the  time  to  train  the  smaller  nets  was  taken  into  consideration).  The  decomposition 
technique,  borrowed  from  Waibel  et  al.  (1989)  includes  the  following  steps: 

1.  Subnet  training:  subnets  are  set  up  and  trained  individually. 

2.  Glue  training:    The  trained  subnets  are  bonded  together  through  additional 
"glue"  units.  The  combined  net  is  trained  with  the  subnet  weights  frozen. 

3.  Fine  tuning:  The  combined  net  is  further  trained  with  all  weights  subject  to 
change. 


uThe  Mexican  hat  function  is  given  by  f(r,  y)  =  4(x'2  +  y2  -  2)e~i£^Tr: 
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4.7.7     Total  Error  vs.  Individual  Error 

Some  researchers,  in  particular,  Yu  and  Simmons  (1990),  considered  using  indi- 
vidual pattern  error,  instead  of  the  total  sum  of  squared  error,  to  guide  the  learning 
process.  Their  argument  was  that  total  error  is  not  as  effective  a  measure  as  a  cor- 
rectness ratio  in  classification  problems.  They  developed  an  algorithm  called  Descent 
Epsilon  where  a  parameter  e  is  used  to  gauge  the  difference  between  a  network  out- 
put and  target  value.  The  output  is  considered  correct  if  the  difference  is  less  than  t. 
Only  those  errors  that  are  greater  than  e  are  backpropagated  to  modify  the  network 
weights.  The  magnitude  of  e  is  gradually  decreased.  Hence  the  total  error  also  goes 
down  with  individual  errors  kept  within  the  t  bound. 

In  conclusion,  this  chapter  has  summarized  the  state-of-the-art  research  in  feed- 
forward neural  network  training.  Most  variations  of  the  back-propagation  algorithm 
are  aimed  at  improving  the  training  speed  and  increasing  generalization  ability  of 
the  feedforward  neural  networks.  Successes  of  various  degrees  have  been  achieved. 
However,  more  efficient  and  globally  convergent  training  algorithms  are  needed  to 
deal  with  more  challenging  real  world  problems.  The  next  three  chapters  will  focus 
on  global  optimal  neural  network  training  algorithms. 


CHAPTER  5 
GLOBALLY  GUIDED  BACKPROPAGATION  (GGBP) 

In  this  chapter  we  propose  a  modification  to  the  standard  backpropagation  algo- 
rithm. The  modification,  while  retaining  the  simplicity  of  the  standard  BP,  intro- 
duces two  nice  properties:  (1)  There  is  a  training  time  speed  up,  and  (2)  convergence 
to  a  global  optimal  solution  is  guaranteed.  We  start  with  a  briefly  discussion  of 
the  shortcomings  of  standard  backpropagation.  Then  we  develop  the  ideas  behind 
our  approach  and  present  the  globally  guided  backpropagation  algorithm  (GGBP). 
Experiments  on  two  standard  test  problems  are  presented. 

5.1     Limitations  of  BP 

The  backpropagation  (BP)  method  is  one  of  the  most  widely  used  learning  algo- 
rithm for  multi-layered  feed-forward  neural  networks.  The  popularity  of  BP  arises 
from  its  simplicity  and  successful  applications  to  many  real  world  problems.  It  is 
commonly  recognized,  however,  that  BP  has  some  inherent  shortcomings.  Two  of 
the  often  cited  BP  shortcomings  are  (1)  slow  or  no  convergence,  and  (2)  the  pos- 
sibility of  getting  stuck  in  local  minimum  solutions  (Tollenaere,  1990;  Hirose  et  al., 
1991). 

The  objective  of  backpropagation  learning  is  to  find  a  set  of  network  weights  such 
that  the  total  error  function  defined  by  some  measure  is  minimized.  Unfortunately, 
the  error  surface  of  a  feed-forward  neural  network  is  generally  very  complicated  due 
to  the  convoluted  nonlinear  transfer  functions.  The  error  surface  is  generally  char- 
acterized by  a  large  number  of  flat  areas  and  troughs  that  have  very  small  slope 
(Hecht-Nielsen,  1989),  and  long  ravines  with  sharp  curvature  (Battit  and  Masulli, 
1990).  Since  the  backpropagation  algorithm  is  a  simple  steepest  gradient  descent 
algorithm  (Rumelhart  et  al.,  1986),  it  does  not  take  into  consideration  the  curvature 
of  the  error  surface,  hence  it  may  become  very  inefficient  either  by  slowly  moving  in 
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the  flat  areas  or  by  oscillating  along  the  ravines.  Also  it  is  clear  that,  with  steepest 
descent,  once  a  solution  gets  stuck  in  a  local  minimum  it  has  no  way  to  escape. 

Although  many  variations  of  BP  have  been  developed  as  discussed  in  the  last 
chapter.  The  effort  to  deal  with  the  first  problem,  that  is,  to  develop  more  efficient 
neural  net  training  algorithms,  has  met  only  partial  success.  Few  researchers  have 
considered  the  problem  of  local  minimum  solutions.  Local  refinements  of  the  BP 
algorithm,  such  as  using  second  order  information  of  the  criterion  function,  improve 
the  learning  speed,  but  still  suffer  the  same  problem  of  staying  stuck  in  a  local 
minimum  once  the  solution  is  trapped. 

5.2     The  Idea  of  Globally  Guided  Backpropagation 

The  error  surface  of  a  feedforward  neural  networks  in  the  weight  space  is  generally 
very  complicated.  Figure  5.1  shows  a  typical  error  surface  of  the  simple  XOR  network 
(cf.  Section  3.3)  where  large  flat  areas  and  narrow  valleys  exist.  It  is  clear  that  a 
strict  gradient  descent  approach  will  encounter  difficulties  in  such  a  weight  space. 
However,  the  error  surface  of  a  feedforwar  neural  network  in  the  output  space  is 
quite  simple.  If  we  use  a  sum  of  squared  error  function,  the  error  surface  is  convex 
quadratic  in  the  output  space. 

F       -  r   k 

Note  that  the  error  in  Equation  (5.1)  is  separable  in  p  and  A:,  which  are  the  indices  for 
the  pattern  (example)  and  the  output  unit  of  the  network,  respectively.  Minimization 
of  the  quadratic  function  is  easy,  if  the  ourput  of  the  network  can  be  controlled.  The 
unique  local  minimum  of  E  is  also  a  global  minimum  solution.  The  optimal  outputs 
are  the  target  values. 

Unfortunately,  solving  for  weights  W  through  the  inverse  function  of  output  0  is 
extremely  difficult,  if  not  impossible.  Because  the  neural  network  output  is  a  sum 
of  convoluted  (nested)  activation  functions  (most  commonly,  sigmoid  functions,  cf. 
Figure  3.2),  weights  W  can  not  be  expressed  analytically  as  a  function  of  the  training 
data.  Even  if  a  set  of  equations  (implicit  functions)  of  the  weights  W ,  input  X  and 
output  O  can  be  written  down,  there  is  no  efficient  procedures  to  solve  the  nonlinear 
system  of  equations. 
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Error  Surface  of  XOR  (2  nodes)  net;  w2=xl,  w6=x2 


Figure  5.1.  Error  surface  of  an  XOR  (2x2x1)  network  showing  valley,  plateau  and 
local  minimum. 

However,  if  we  change  the  output  by  a  small  amount,  we  will  be  able  to  find  the 
changes  in  weights  W  via  a  Taylor  series  expansion  of  0. 

AO    =    0(W  +  AW,X)-0(W,X) 

=    VwO{W,X)AW  +  l-AWTV2wO(W  +  £AW,X)AW  (5.2) 

where  {  €  (0,1). 

If  we  update  the  weights  of  the  network  based  on  the  changes  in  0,  instead  of 
-T]VWE  as  in  standard  backpropagation,  then  we  have  reason  to  hope  that  this 
weight  updating  scheme  would  (1)  lead  to  faster  convergence,  since  the  search  in  the 
weight  space  is  guided  directly  by  the  search  in  the  output  space,  and  (2)  lead  to  a 
global  optimal  solution.  The  idea  is  to  ignore  the  shape  of  the  error  surface  in  the 
weight  space  by  moving  W  in  a  direction  such  that  in  the  output  space  error  E  is 
always  decreasing.  This  concept  is  depicted  in  Figure  5.2. 
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Figure  5.2.  AW  corresponding  to  AO  would  lead  W  to  a  global  optimal  solution. 

5.3     Learning  Rule  Derivation 

The  learning  rule  of  GGBP  is  derived  based  on  the  changes  in  output  space.  Let 
us  consider  a  given  training  pattern.  The  error  function  is 


k=l  z  h=\ 


(5.3) 


where  k  is  the  index  for  the  output  units. 

Changing  output  0  =  (0,,  02, ...,  0k)T  based  on  gradient  descent  in  the  output 
space  gives 

A0(n)  =  0(n  +  1)  -  0{n)  =  -nV0E{n)  (5.4) 

where  n  is  the  iteration  index.  Using  equation  (5.3) 


A0(n)  =  n(T  -  0(n)). 


(5.5) 


Assuming  the  changing  in  W  is  small,  we  may  use  first  order  approximation  in  Equa- 
tion (5.2)  (For  clarity  we  omit  writing  the  variables  weights  W  and  input  X  in 
function  0{W,X)).  Hence 


A0(n)  =  VwO{n)AW. 


(5.6) 
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Figure  5.3.  A  typical  FNN  where  the  weights  associated  with  Ot  are  independent  to 
other  output  units. 

Note  that  here  AO(n)  is  a  K  dimensional  vector,  AW  is  an  S  dimensional  vector, 
and  VwO  is  a  K  x  S  matrix.  Finding  a  AW  requires  the  psedoinverse  of  the  matrix 
Vw0.  This  is  computationally  undesirable.  Considering  the  special  structure  of  the 
feed-forward  neural  network,  we  notice  that  the  weights  of  the  output  layer  associated 
with  output  unit  i  are  independent  of  the  output  units  0k,k  =  1,2, ...,K, k  ^  i  (see 
Figure  5.3). 

We  can  rewrite  AO  as 


&0  =  [VWhO,VWoiO,...,Vw     0] 


wQl 


Wo 


K    J 


(5.7) 


where  WH  denotes  weights  in  the  hidden  layer(s)  and  W0,  the  output  layer  weights 
associated  with  output  node  i.  Each  component  of  AO  becomes 


A0k    =    VwHOkAWH  +  VWoOkAW0k 
=    VwkOkAWk         k  =  1,2,...,  K 


(5.8) 


where  Wk  denotes  all  weights  contributing  to  output  Ok.    A0k  becomes  the  inner 
product  of  two  vectors,  VwkOk  and  A^.    This  inner  product  is  maximized  if  we 
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choose  AWk  in  the  direction  of  VWkOk.  Thus  we  have 

A0k  =  \\VmOk\\\\AWk\\         fc  =  1,2,...,  A'  (5.9) 


or 


l^wll  =  W^M\     *n.v,.Jr.  (5,o, 

The  normalized  component  of  A\Vk  is 


Awa  =  \\AWk\ 


Substituting  ||AW*||  with  Equation  (5.10)  gives 


(5.11) 


Aw. 


AOk&* 


dw 


(5.12) 


||VH*0*||a" 

Replacing  AOk  using  Equation  (5.5)  results  in 

If  ws  is  a  weight  in  the  output  layer,  Equation  (5.13)  is  used  as  weight  updating 
rule.  If  ws  is  a  weight  in  a  hidden  layer,  we  need  consider  the  effect  of  all  the  outputs 
on  it.  The  changes  due  to  each  output  Ok,  k  =  1,2, ...,  K  are  summed  up.  Hence  we 
have 

*  n(Tk  -  ok)^ 

A^  =  E    y        (ao$  (5.14) 

for  all  s  eWH. 

This  heuristic  approach  (summing  up  Atu.'s)  is  also  used  in  White  (1990)  where 
similar  results  are  obtained  from  an  application  of  Newton's  method.  The  advantage 
of  this  approach  is  the  simplicity  of  the  weight  updating  rule.  The  down-side  of  this 
heuristic  is  that  the  weight  changes  in  hidden  layers  are  approximated.  Ideally,  the 
weight  changes  should  depend  on  the  current  training  pattern,  not  each  component 
of  that  pattern.  Towards  that  end,  a  more  rigorous  hidden  layer  weight  updating 
method  can  be  derived  as  follows. 
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Summing  up  the  components  of  AO  (cf.  Equation  (5.8)),  we  have 

£  A0fc  =  £  VWflOkAWH  +  £  Vwr0  OkAW0k.  (5.15) 

Because  of  the  special  structure  of  the  feedforward  network,  AWH  is  the  same  for  all 
k  =  1,2, ...,  K.  Thus  we  can  separate  AWH  and  obtain 

Similar  to  the  derivation  of  equation  (5.13),  we  have  for  all  s  6  WH 


y*       dO 

Aw,    =    \\AWH" 


dw. 


\\EkVwuOk\\ 

(J2kAOk-ZkVw0kOkAWok)Ekd^; 

IIEfcVHfcOklp 

=  (i:kn(Tk-ok)-i:teWo^Aw,)Ek^ 


(5.17) 


Comparing  Equation  (5.17)  with  Equation  (5.13),  we  note  that  the  equation  is 
more  complicated  and  requires  explicitly  the  computation  of  weight  changes  in  the 
output  layer. 

Recall  in  standard  backpropagation,  the  weights  are  updated  with  the  following 
formula: 

A„.  =  -A|£  =  A(n-0l)f£  (5.18) 


for  the  output  layer  and 


A""  =  -Xjt  =  Xhn-Ol,)'^  (5.19) 

OWs  ,._,  Ov 


for  the  hidden  layer(s). 

Note  the  similarity  of  this  weight  updating  scheme  of  GGBP  with  that  of  the 
standard  BP,  especially  when  (5.14)  is  used.  The  new  methods  is  similar  to  the 
standard  BP  with  a  dynamically  adjusted  learning  rate 


aotX  (5.20) 
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where  F  is  a  function  of  the  partials  of  the  output  with  respect  to  the  weights.  The 
concepts  of  the  two  approaches  are,  however,  quite  different.  With  GGBP,  77  is  a 
fixed  learning  parameter  in  the  output  space,  while  A  of  the  standard  BP  is  a  fix 
learning  rate  in  the  weight  space. 

5.4     Convergence  of  GGBP 

Updating  weights  with  Equation  (5.13)  and  (5.14)  (or  (5.17))  will  ensure  that  the 
global  error  is  decreasing,  as  long  as  the  approximation  used  in  the  Taylor  expansion  is 
valid.  Following  Equation  (5.4),  and  changing  notation  slightly  (by  explicitly  putting 
in  W  and  X).  we  have 

AO{Wn,X)    =    0{W7\X)-0(Wn-lX)  =  n(T-0(Wn-\X))      (5.21) 
0(Wn,X)    =    T)T  +  (l-ri)0(Wn-\X) 

=  t}T  +  (l  -  n)nT  +  •  •  ■  +  (1  -  n)nO(W°, X) 

=     r[l-(l-r/)n]  +  (l  -n)nO{W°,X).  (5.22) 

For  any  77  €  (0,  l),0(Wn,  X)  — >  T  as  n  — >  00.  That  is,  the  output  converges 
to  the  target  value.  Note  that  the  convergence  property  is  guaranteed  by  (5.21) 
only  for  the  case  of  a  single  example.  For  a  multi-example  training  set  the  weight 
updating  rules  of  GGBP  are  still  valid  if  the  instance  training  method  is  used.  But 
the  convergence  proof  remains  an  open  issue  as  in  the  case  of  standard  BP.  Empirical 
results,  though,  have  shown  that  convergence  is  typical  when  77  is  small. 

The  extension  of  the  GGBP  algorithm  to  sample  training  is  not  straight-forward 
because  the  output  becomes  a  matrix  when  all  patterns  are  considered.  Conceptually, 
the  GGBP  approach  is  still  applicable.  The  derivation  of  the  weight  updating  rules 
then  requires  iterative  solutions  to  a  system  of  linear  equations.  On  the  other  hand, 
a  heuristic  of  applying  GGBP  to  sample  training  is  to  simply  add  up  the  weight 
changes  resulting  from  all  training  patterns  and  update  the  weights  based  on  the 
sums. 
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5.5     The  GGBP  Algorithm 

The  GGBP  algorithm  is  similar  to  the  standard  backpropagation  algorithm.  The 
implementation  is  straightforward.  Note  that  there  is  a  slight  difference  in  the  def- 
inition of  8  in  the  two  algorithms.  Also  we  do  not  use  the  momentum  term  in  our 
algorithm  since  GGBP  is  supposed  to  search  in  the  output  space  where  there  is  no 
ravines  and/or  plateaus.  GGBP  is  formally  stated  below. 

Algorithm  GGBP 
1.  INITIALIZE: 

•  Construct  the  feedforward  neural  network.  Choose  the  number  of  input 
units  and  the  number  of  output  units  equal  to  the  length  of  input  vector 
X  and  the  length  of  target  vector  T,  respectively. 

•  Randomize  the  weights  and  bias  in  the  range  (-0.5,  0.5). 

•  Specify  a  stopping  criterion  such  as  E  <  Estop  or  n  >  nmax. 
2.  FEEDFORWARD: 

•  Compute  the  output  for  the  non-input  units.  The  network  output  for  a 
given  example  p  is 

oPt  =  /(£  "W'(£«W(-  •  ■  /(!>«*,•)))). 

Note  that  0j  is  replaced  by  w0j  for  notational  convenience. 

•  Compute  the  error  using  Equation  3.7. 

•  If  a  stopping  criterion  is  met,  stop. 

3.  BACKPROPAGATE: 
For  k  =  1,2,...,  K  repeat 

•  For  each  output  unit  k,  compute 

Sj  =  f{netk). 
Pk  *-  Pk  +  Sk 
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•  For  each  hidden  unit  j,  compute 

Sj  =  SkWjkf  (netj). 


fh  ^fa  +  S] 


End  repeat. 
4.  UPDATE: 

•  For  output  layer 


• 


For  hidden  layer 


bWik  =  v{Tk-Ok)6kOjlfa 


£LWii  =  'ETi(Tk-Ok)6iOi/fik 


5.  REPEAT: 
Go  to  Step  2. 

5.6     Experiments 

Two  test  problems  are  used  to  illustrate  and  evaluate  the  performance  of  GGBP. 
Both  problems  are  standard  test  problems.  All  tests  were  run  on  a  80386-Micro 
computer.  The  reported  results  are  averages  of  20  runs  starting  with  the  same  random 
initial  weights  for  both  GGBP  and  the  standard  BP.  All  numbers  are  rounded  to  their 
nearest  integers. 

5.6.1     The  XOR  Problem 

The  Exclusive  Or  (XOR)  problem  has  been  used  extensively  as  a  benchmark  for 
neural  network  algorithm  evaluation  due  to  historical  reasons.  The  problem  has  been 
described  in  Section  3.3.  Solving  the  problem  requires  classifying  the  inputs  into  two 
disjoint  classes.  Separating  those  two  classes  is  hard  because  it  requires  a  non-convex 
separation.  A  2  x  2  x  1  feedforward  network  is  used.  Table  5.1  shows  the  experiment 
results  of  GGBP  vs  standard  BP  in  terms  of  training  epochs. 

Results  in  Table  5.1  show  that  GGBP  is  3  to  13  times  faster  than  standard  BP 
(BP  used  the  parameters  r)  and  a  recommended  by  Rumelhart  et  al.    (1986)).    For 
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Table  5.1.  Training  Epochs  of  GGBP  vs  BP  for  the  XOR 


E-stop 

BP  lr=0. 5  mo=0. 9 

GGBP  lr=0.5 

BP  lr=0.5 

mean 

std  dev 

mean 

std  dev 

mean 

std  dev 

0.04 
0.01 
0.001 

206 
292 
1369 

89 
45 

310 

62 

71 

110 

10 
10 

52 

2148 

410 

the  sake  of  comparison,  standard  BP  without  the  momentum  term  is  tested,  which 
resulted  in  a  convergence  speed  about  35  times  slower  than  that  of  GGBP.  As  the 
stopping  criterion  becomes  more  stringent,  the  difference  between  GGBP  and  BP 
becomes  more  significant.  This  is  no  surprise  as  the  GGBP  uses  an  approximation 
scheme  that  is  best  in  the  neighborhood  of  the  global  minimum,  while  standard  BP 
slows  down  when  the  error  signal  becomes  small.  Typical  learning  curves  of  both 
GGBP  and  BP  are  shown  in  Figure  5.4.  Note  that  the  GGBP  solution  oscillates  in 
the  beginning.  This  shows  that  the  linear  approximation  used  in  algorithm  is  very 
crude  while  random  initial  weights  dominate.  The  approximation  becomes  more 
effective  when  the  weights  are  brought  closer  to  the  global  optimal  point.  We  used 
the  heuristic  method  in  the  hidden  layer  weight  updating,  which  may  also  contribute 
to  the  inaccuracy  during  the  initial  learning  period. 

5.6.2     The  424  Encoding  Problem 

The  encoding  problem  was  proposed  by  Ackley,  Hinton  and  Sejnowski  (1985). 
The  problem  is  to  map  Ar-tuple  input  patterns  to  AM,uple  output  patterns  through 
a  hidden  layer  with  log2N  units.  Passing  through  the  hidden  layer  requires  data 
compression.  In  other  words,  the  problem  requires  encoding  of  an  N  bit  pattern  into  a 
log2N  bit  pattern.  Passing  from  the  hidden  layer  to  the  output  layer  requires  decoding 
the  compressed  pattern  into  the  original  one.  General  neural  network  structure  for 
solving  the  encoding  problem  is  N  x  log2N  x  N. 
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Figure  5.4.  Learning  curve  of  GGBP  (solid  line)  vs  BP  (dotted  line). 


Table  5.2.  Training  Epochs  of  GGBP  vs  BP  for  the  424  Encoding 


E-stop 

BP  lr  =  0.5  rno=0.9 

GGBP  lr=0. 5 

mean 

std  dev 

mean 

std  dev 

0.04 
0.01 

935 
4635 

647 
2545 

177 
187 

155 
182 

We  tested  GGBP  on  a  4  x  2  x  4  network.  The  results  are  summarized  in  Table  5.2. 
The  speed-up  of  GGBP  over  the  standard  BP  is  a  factor  of  5  to  nearly  25.  Similar 
to  the  case  of  the  XOR  problem,  the  performance  of  GGBP  is  significantly  better 
than  the  standard  BP  when  the  solution  standard  is  set  higher.  While  the  number  of 
training  epochs  of  BP  increased  about  4  times  when  the  stopping  criterion  decreased 
from  0.04  to  0.01,  the  number  of  training  epochs  of  GGBP  changed  very  little. 

5.7     Comparison  of  GGBP  and  BP 

Several  advantages  of  GGBP  over  standard  BP  are  shown  in  the  derivation  of  the 
algorithm  and  evidenced  in  the  test  results.    First  of  all,  GGBP  starts  with  a  new 
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concept.  The  algorithm  considers  optimization  of  the  global  function  in  the  output 
space.  This  leads  to  a  faster  learning  and  convergence  to  a  global  optimal  solution. 
The  speed  advantage  can  be  attributed  to  the  fact  that  the  search  is  guided  by  the 
changes  in  the  output  space.  That  is,  the  weight  change  in  the  weight  space  does  not 
necessarily  follow  the  gradient  descent  direction.  The  problems  associated  with  flat 
plateaus  and  deep  ravines  in  the  weight  space  with  standard  BP  are  avoided. 

The  second  advantage  of  GGBP  is  that  it  does  not  use  the  momentum  term. 
Choosing  a  good  combination  of  learning  rate  and  momentum  with  standard  BP 
often  poses  a  challenge  to  the  inexperienced  neural  network  users.  In  this  sense, 
GGBP  is  easier  to  use  than  standard  BP.  We  noticed  in  our  experiments  that  a 
learning  rate  less  than  0.5  usually  produces  fast  and  stable  solutions. 

Although  at  this  implementation  GGBP  has  a  constant  learning  rate.  This  need 
not  to  be  true.  A  dynamically  adjusted  learning  rate  might  improve  its  performance. 
Even  with  a  fixed  learning  rate  (in  the  output  space),  GGBP  is  analogous  to  standard 
BP  with  a  dynamic  learning  rate  in  the  weight  space.  The  dynamics  of  the  learning 
rate  adjusting  in  the  weight  space  is  well-founded  in  GGBP  by  the  derivation  of 
the  algorithm.  BP  with  dynamically  adjusted  learning  rate  has  been  studied  by 
several  researchers  (Vogl  et  al.,  1988;  Jacobs,  1988;  Silva  and  Almeida,  1990).  Those 
approaches  are  heuristics.  They  work  in  some  limited  domain  and  may  produce 
controversial  results.  Viewed  as  BP  with  dynamical  learning  rate,  GGBP  provides 
a  learning  rate  adjusting  mechanism  that  avoids  the  detailed  considerations  of  the 
shape  of  the  error  surface  in  the  weight  space. 

The  speed-up  of  GGBP  over  BP  is  evidenced  by  experiments.  A  remarkable 
feature  of  GGBP  is  that  it  still  has  a  fast  learning  speed  even  when  the  error  becomes 
small  while  BP  becomes  hopelessly  slow.  This  feature  could  be  especially  beneficial 
to  problem  domains  where  accurate  learning  is  required. 

Besides  the  speed  and  parameter  advantages.  GGBP  makes  its  search  guided  by 
the  changes  in  the  output  space  where  a  local  minimum  is  also  the  global  minimum. 
Thus,  theoretically,  GGBP  should  always  find  a  global  optimal  solution  with  small 
enough  stepsize.  Care  must  be  exercised,  however,  in  drawing  those  conclusions. 
All  the  nice  properties  of  GGBP  are  built  on  the  assumption  that  the  small  weight 
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changes  using  the  updating  rule  will  produce  the  desired  output  change  which  leads 
to  decreasing  of  the  global  error.  Careful  examination  of  this  assumption  reveals  that 
it  is  only  approximately  true.  Part  of  the  inaccuracy  results  from  the  first  order  ap- 
proximation via  Taylor's  expansion  of  the  output  function  0(W,X).  Another  factor 
that  may  adversely  affect  the  approximation  is  that  the  hidden  weights  of  the  neural 
network  are  dependent  on  all  the  output  units.  The  asynchronous  presentation  of 
target  values  (for  a  given  pattern)  renders  the  computation  of  hidden  layer  weight 
change  inaccurate.  Nevertheless,  the  GGBP  algorithm  is  shown  to  perform  signifi- 
cantly better  than  the  standard  BP.  The  performance  of  GGBP  could  be  improved  by 
considering  higher  order  approximations  and  synchronized  parallel  implementation. 
It  is  not  clear  how  those  improvements  can  be  carried  out  yet,  but  the  concept  of 
computing  weight  change  to  produce  desired  output  change  is  appealing.  Research 
along  this  line  could  be  promising. 


CHAPTER  6 
STOCHASTIC  GLOBAL  ALGORITHMS 

The  globally  guided  backpropagation  (GGBP)  algorithm  introduced  in  Chap- 
ter 5  guarantees  a  global  optimal  solution  as  long  as  the  learning  rate  is  small 
enough.  However,  the  requirement  of  small  learning  rate  may  cause  slow  conver- 
gence. The  interest  in  finding  a  global  optimal  solution  and  efficient  learning  al- 
gorithms has  prompted  neural  network  researchers  to  look  into  global  optimization 
literature.  Some  researchers  have  explored  the  use  of  genetic  algorithm  and  simu- 
lated annealing  in  neural  network  training.  In  this  chapter,  we  will  discuss  the  search 
mechanisms  and  their  implementation  in  feedforward  neural  network  using  stochastic 
global  algorithms:  genetic  algorithm,  simulated  annealing,  random  search  methods, 
and  clustering  methods. 

6.1     Genetic  Algorithm 

The  concept  of  genetic  algorithm  (GA)  was  introduced  by  Holland  (1975).  Genetic 
algorithms  are  a  class  of  search  algorithms  based  on  several  features  of  biological 
evolution,  such  as  cross-over  (mating)  and  random  perturbation  (mutation).  In  recent 
years,  genetic  algorithms  have  been  successfully  applied  to  a  large  variety  of  problems 
in  optimization,  learning,  and  operations  management  (Goldberg,  1989).  Generally, 
a  genetic  algorithm  has  the  following  components: 

1.  An  encoding/decoding  scheme  that  maps  the  solution  of  the  problem  to  a  bit 
stream  (chromosome). 

2.  An  initial  population  consisting  of  initial  possible  solutions. 

3.  A  set  of  operators  that  are  applied  to  the  population.     As  a  result,  a  new 
population  is  generated. 

4.  A  criterion  function  that  measures  the  fitness  of  a  solution. 
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A  genetic  algorithm  starts  with  an  initial  population.  The  members  of  the  popula- 
tion are  evaluated  with  the  criterion  function.  Part  of  the  population  is  chosen  to  cre- 
ate the  next  generation  through  cross-over,  mutation,  and/or  other  domain-specific 
operators.  Selection  of  the  parent  members  are  determined  by  certain  probability 
distribution  of  their  fitness  measured  by  the  criterion  function  (Holland,  1975). 

The  cross-over  operator  is  applied  to  two  parents.  A  random  bit  of  the  bit  stream 
is  chosen,  at  that  point  the  parents'  bits  are  crossed-over.  That  is,  the  parents  ex- 
change part  of  their  bit  streams  starting  from  the  chosen  bit.  The  mutation  operator 
is  applied  to  a  single  parent  of  child.  A  random  bit  of  the  parent  is  chosen  and  is 
changed  to  its  complement. 

For  the  application  of  genetic  algorithms  to  feedforward  neural  networks,  a  simple 
implementation  is  to  encode  all  weights  and  biases  as  a  single  vector  (Montana  and 
Davis,  1989).  For  example,  for  the  XOR  network  with  a  single  hidden  mode  (cf. 
Figure  3.8),  a  solution  is  represented  by  a  vector  w  =  (wi,w2,  u>3,  w4,  iv5,  teg,  w7).  An 
initial  population  P0  =  {to1, ...,  wn}  can  be  generated  with  each  u\,t  =  1,2, ...,  7,being 
taken  from  a  random  distribution,  say,  uniform  or  Gaussian  distribution.  The  cross- 
over operator  is  applied  as  discussed  before.  The  mutation  operator  can  be  modified 
such  that  a  random  perturbation  is  added  to  a  randomly  chosen  component  of  the 
parent. 

Montana  and  Davis  (1989)  reported  that  their  genetic  algorithm  outperformed 
the  classic  backpropagation  algorithm  (without  momentum).  A  more  involved  coding 
scheme  was  used  by  Chalmers  (1990)  where  the  weight-space  dynamics  was  coded  as 
linear  genomes  consisting  of  bit  streams.  Belew  et  al.  (1990)  considered  using  the 
genetic  algorithm  to  generate  a  good  initial  weight  set  w0  that  is  then  used  in  place 
of  random  initial  weights  of  the  backpropagation  algorithm.  As  can  be  expected, 
the  performance  of  BP  was  improved  with  ?/;0  chosen  by  GA.  The  results  of  Offutt 
(1989)  showed  that  GA  could  train  a  feedforward  neural  network  more  quickly  than 
the  BP — if  the  network  is  fairly  deep.1 

*As  discussed  in  Chapter  4,  classic  BP  is  inefficient,  even  more  so  when  applied  to  deep  nets 
(i.e.,  nets  with  many  hidden  layers). 
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The  search  mechanism  of  genetic  algorithm  can  be  implemented  within  the  BP 
algorithm  to  help  increase  learning  speed  and  avoide  local  minima.  The  idea  is 
that  when  the  BP  algorithm  is  detected  to  be  in  a  flat  region  where  the  gradient  in 
the  weight  space  is  nearly  zero,  a  large  jump  incurred  by  sufficient  mutation  of  the 
current  solution  should  be  more  efficient  in  bringing  the  solution  out  of  the  stagnant 
status  than  a  gradient  descent  move.  If  the  solution  is  stuck  at  a  local  minimum, 
the  gradient  descent  approach  simply  fails  to  proceed,  while  genetic  mutation  (or 
possibly  cross-over  of  different  solutions)  may  make  a  solution  tunnel  through  the 
surrounding  peeks  of  the  local  minimum,  and  lead  to  the  attraction  region  of  some 
more  promising  (local)  minimum. 

When  to  apply  GA  can  be  determined  by  the  following  heuristics:  A  gradient 
threshold  0  is  defined.  0  can  be  preset  or  dynamically  derived.  A  weight  u;,-  is 
labeled  inert  whenever  \§£-\  <  0.  Between  each  regular  BP  session,  those  weights 
labeled  inert  are  perturbated  by  a  random  amount  (mutation).  If  ||£-|  <  0  for  all 
W{,  then  the  current  solution  must  be  in  a  flat  area  of  the  weight  space.  A  cross-over 
between  the  current  solution  and  a  different  solution  can  be  performed.  The  genetic 
algorithm  augmented  backpropagation  (GAABP)  algorithm  is  stated  below 

Algorithm  GAABP 

1.  INITIALIZE: 

•  Construct  the  feedforward  neural  network.  Choose  the  number  of  input 
units  and  the  number  of  output  units  equal  to  the  length  of  input  vector 
x  and  the  length  of  target  vector  t ,  respectively. 

•  Randomize  the  weights  w(0)  (including  bias)  in  the  range  (-.5,  .5). 

•  Specify  a  stopping  criterion  such  as  F  <  Fstop  or  n  >  nmax.  Set  iteration 
number  n  =  0. 

•  Set  w  =  w(0),  w  is  the  candidate  weight  set  for  cross-over  with  current 
weight  set  w(n). 

2.  FEEDFORWARD: 
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•  Compute  the  output  for  the  non-input  units.    The  network  output  for  a 
given  example  p  is 

°pk  =  /(£  Wjkf(52  Wmjfi-  •  ■  f(J2  WilXi)))). 
j  "»  t 

•  Compute  the  error  using  Equation  3.7. 

•  If  a  stopping  criterion  is  met,  stop. 

3.  BACKPROPAGATE: 

•  n  <—  n  +  1. 

•  For  each  output  unit  k,  compute 

<*>fc  =  (o*  -  'Jk)f'{netk). 

•  For  each  hidden  unit  j,  compute 

Sj  =  f'{netj)^2Skwjk. 

k 

•  If  \SjOi\  <  0,  then  label(u>ij)  =  inert. 

4.  UPDATE: 

•  Mutation:  If  label(w,j)  =  inert,  then 

Awt:{n  +  [)  =  Random,{F  -  Fstop) 

where  F  is  the  current  criterion  value,  and  Fstop  the  desired.  Random() 
is  a  function  returning  a  random  value  of  Aivi}  with  a  given  probability 
distribution. 

•  Cross-over:  If  label(Wij)  =  inert  for  all  Wij  €  w(n),  then  let  w\w2  be  the 
results  of  Cross- over(iu(n),  w)  and  w1  =  argmin{F{wl),F(w2)}.  Update 
weights  with 


Wij(n  +  1)  =  w\- 


Wij  =  w2tJ. 
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•  Gradient  descent: 

Awij(n  +  1)  =  TjSjOi  +  aAwij(n) 

where  77  >  0  is  the  learning  rate  (step  size)  and  a  €  [0, 1)  is  the  momentum. 

5.  REPEAT: 
Go  to  Step  2. 

Generally,  mutation  produces  local  variations,  while  cross-over  enables  larger 
changes  that  may  help  a  stagnant  solution  to  move  out  the  local  minima.  Ran- 
dom mutation  may  follow  a  uniform  distribution  or  a  Gaussian  distribution.  The 
cross-over  operator  returns  two  new  weight  sets.  The  one  with  better  objective  value 
is  taken  as  the  updated  solution,  and  the  other  is  used  as  the  candidate  for  the  next 
cross-over  operation. 

6.2     Simulated  Annealing 

Simulated  annealing  is  a  general  heuristic  optimization  algorithm.  The  algorithm 
is  based  on  concepts  from  statistical  physics.  Kirkpatrick  et  al.  (1983),  in  the  early 
Eighty's,  noticed  that  there  is  a  strong  similarity  between  combinatorial  optimization 
and  the  annealing  of  solid  materials  such  as  metals.  In  a  physical  thermal  dynamic 
system,  the  system  state  is  characterized  by  a  probability  distribution  known  as 
Boltzmann  distribution  at  thermal  equilibrium,  as  shown  in  Figure  6.1. 

The  horizontal  axis  is  system  energy  and  the  vertical  axis  is  the  probability  of  the 
system  at  a  state  with  energy  E.  From  the  distribution  we  notice  that:  (1)  the  system 
state  with  lower  energy  has  higher  probability,  and  (2)  as  temperature  T  decreases, 
the  system  become  stable  at  low  energy  state,  because  the  probability  of  the  system 
being  in  a  high  energy  state  approaches  zero  as  the  temperature  decreases.  The 
annealing  process  is  to  reduce  the  system  temperature  slowly  such  that  the  thermal 
system  will  reach  a  global  minimum  energy  state  following  a  sequence  of  equilibrium 
states.  Note  that  if  the  system  fails  to  reach  equilibrium  at  each  temperature,  e.g., 
when  the  temperature  is  reduced  very  fast,  then  the  system  can  never  reach  the  global 
minimum  energy  state,  as  shown  in  Figure  6.2. 
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Figure  6.1.  Boltzmann  distribution  at  different  temperatures 
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Figure  6.2.  Equilibrium  and  non-equilibrium  energy  state 
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In  an  optimization  system,  we  define  the  objective  function  value  as  the  system 
energy  and  each  feasible  solution  as  a  system  state  with  corresponding  system  energy. 
A  control  parameter  T  (an  analog  to  temperature  in  a  physical  annealing  process) 
is  introduced  such  that  at  each  value  of  T,  we  give  a  probability  distribution  of  the 
system  energy.  Thus,  at  each  value  of  T,  the  system  energy  assumes  a  range  of  values, 
each  with  certain  probability.  As  in  the  annealing  process  of  a  physical  system,  if  T 
is  reduced  slowly,  the  optimization  system  will  eventually  reach  the  global  optimum 
solution  with  minimum  system  energy. 

Generally,  SA  starts  with  a  feasible  initial  solution.  New  solutions  are  generated 
from  the  current  solution.  The  key  feature  of  SA  is  the  acceptance  rule  (known  as 
the  Metropolis  rule  (Metropolis  et  al.,  1953)  of  new  solutions.  A  new  solution  with 
better  objective  value  is  always  accepted.  This  is  similar  to  the  gradient  descent 
approach.  However,  a  new  solution  that  is  worse  than  the  old  solution  may  also  be 
accepted,  depending  on  a  given  probability  distribution.  The  probability  of  accepting 
deteriorated  solutions  will  decrease  as  the  algorithm  proceeds.  This  is  controlled  by 
the  parameter  T.  The  mechanism  of  allowing  occasional  hill  climbing/tunneling, 
instead  of  strict  descent,  enables  the  SA  algorithm  to  avoid  local  minima,  provided 
a  proper  annealing  schedule  is  followed. 

Simulated  annealing  has  been  successfully  applied  to  many  combinatorial  opti- 
mization problems  (Laarhoven  and  Aarts,  1987).  Eglese  (1990)  presented  a  good 
review  of  the  algorithm  and  its  applications  in  operations  research.  A  few  researchers 
have  applied  the  concepts  of  SA  to  neural  network  training.  Bernasconi  (1990)  used 
the  basic  SA  technique  in  neural  nets.  Random  weight  vectors  were  generated,  and 
the  probability  of  acceptance  of  a  new  weight  vector  is  given  by 

P         -I1  ifAF<0 

raccept  ~  \  exp(-AF/T)     if  AF  >  0.  (6>1) 

Fang  and  Li  (1991)  reported  a  similar  approach  for  training  feedforward  neural 
networks.  They  used  three  different  probability  distribution  functions,  namely,  uni- 
form, Gaussian,  and  Cauchy  distribution,  to  generate  random  perturbations  of  the 
current  weight  vector.  For  uniform  and  Gaussian  distribution,  the  following  annealing 


92 

schedule 

T  =  T°/(l+Clog(n  +  l))  (6.2) 

is  used  to  ensure  asymptotic  convergence  to  a  global  optimal  solution,  where  T°  is  the 
initial  temperature,  C  is  a  constant,  and  n  is  the  number  of  simulation  steps.  Note 
that  this  annealing  schedule  requires  a  training  time  that  is  an  exponential  function 
of  the  ratio  of  starting  temperature  T°  to  ending  temperature  TN . 

Sue  (1986)  advocates  the  use  of  a  Cauchy  distribution.     In  this  case,  the  SA 
requires  an  annealing  schedule  that  is  linear  in  the  ratio  T°/TN.  Here 

T  =  T°/(l  +  n/C).  (6.3) 

Thus  SA  with  a  Cauchy  distribution  is  called  fast  simulated  annealing  (FSA)  by  Sue 
(1986). 

The  concepts  of  SA  can  be  implemented  in  the  classic  backpropagation  algorithm. 
Our  experiments  on  the  learning  curve  of  BP  (see  Figure  5.2)  have  shown  that  BP 
training  (with  instance  updating)  consists  of  three  phases.  In  the  initial  iterations, 
the  learning  process  exhibits  a  chaotic  behavior.  This  is  followed  by  a  adaptation 
period  in  which  the  error  decreases  very  slowly.  The  third  phase  usually  shows  a  fast 
decrease  of  the  total  error  and  leads  to  convergence  (if  this  phase  occurs).  SA  can  be 
used  to  generate  good  initial  solutions  for  the  BP  algorithm.  This  would  reduce  the 
adaptation  period  of  BP  and  lead  the  process  directly  to  fast  convergence. 

The  SA  method  can  also  be  interwind  with  BP  for  avoiding  local  minima  and 
improving  global  convergence.  Similar  to  the  GAABP,  a  heuristic  for  detecting  BP 
stagnancy  is  needed.  Once  the  BP  solution  is  stuck  in  a  local  minimum  or  a  flat  area 
in  the  weight  space,  an  SA  procedure  is  invoked.  The  annealing  schedule  is  controlled 
by  the  difference  of  the  criterion  value  of  the  current  solution  and  the  desired  value. 
So  the  probability  of  hill-climbing/tunneling  is  large  when  the  solution  is  far  away 
from  a  global  minimum,2  and  the  probability  of  accepting  worse  solutions  is  reduced 
as  the  quality  of  the  current  solution  improves.    The  existence  of  a  known  lower 

2This  is  stated  in  a  loose  sense  meaning  that  the  current  solution  is  far  from  satisfactory,  not 
that  the  distance  between  the  current  solution  and  a  global  optimal  solution  is  large. 
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bound  for  the  criterion  function  makes  the  choosing  of  a  desired  value  feasible.  An 
SA  augmented  backpropagation  (SAABP)  algorithm  is  presented  below. 

Algorithm  SAABP 

1.  INITIALIZE: 

•  Construct  the  feedforward  neural  network.  Choose  the  number  of  input 
units  and  the  number  of  output  units  equal  to  the  length  of  the  input 
vector  x  and  the  length  of  the  target  vector  t,  respectively. 

•  Randomize  the  weights  w(0)  (including  bias)  in  the  range  (-.5,  .5). 

•  Specify  a  stopping  criterion  such  as  F  <  Fstop  or  n  >  nmax.  Set  iteration 
number  n  =  0. 

2.  FEEDFORWARD: 

•  Compute  the  output  for  the  non-input  units.  The  network  output  for  a 
given  example  p  is 


°pk  =  f(J2  wJ*f(Yl  «W(-  ■  •  f($2  «»»*<))))■ 


•  Compute  the  error  using  Equation  3.7. 

•  If  a  stopping  criterion  is  met,  stop. 

3.  BACKPROPAGATE: 

•  n  *—  n  +  1. 

•  For  each  output  unit  k,  compute 

h  =  {Ok  -  Vk)f{netk). 

•  For  each  hidden  unit  j,  compute 

k 

•  If  —SjOi  <  0,  then  label(wij)  =  inert. 
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4.  UPDATE:  If  labeliwij)  ^  inert,  then 

Awij(n  +  1)  =  qSjO{  +  aAii)ij(n) 
where  rj  >  0  is  the  learning  rate  (step  size)  and  a  €  [0, 1)  is  the  momentum. 

5.  ANNEALING: 

•  If  label(wij)  =  inert  for  all  u?,j  e  w(n),  then  generate  Aw  from  Random(), 
which  returns  a  random  perturbation  of  w  based  on  a  prescribed  distribu- 
tion. 

•  If  AF  =  F(w{n)  +  Aw)-AF=  F(w{n))  <  0,  then  w(n  +  l)  =  w(n)  +  Aw. 
Otherwise,  if  exp(-CAF/AF,)  >  um/orm(0, 1),  then  w(n  +  1)  =  w(n)  + 
Aw,  where  C  is  a  constant,  AFt  is  the  difference  between  current  objective 
value  and  the  desired  value. 

6.  REPEAT: 
Go  to  Step  2. 

Instead  of  the  parameter  (temperature)  T,  we  use  AFt  as  the  control  factor  in  the 
Boltzmann  distribution.  Using  AFt  has  the  advantage  of  autonomous  control.  That 
is,  the  probability  of  a  non-descent  move  is  determined  by  the  quality  of  the  current 
solution.  The  algorithm  will  stop  when  the  difference  between  the  current  error  level 
and  the  allowed  tolerance  is  small  enough. 

The  gradient  threshold  factor  0  provides  a  control  mechanism  to  adjust  the  rela- 
tive effect  on  neural  net  training  by  backpropagation  and  simulated  annealing.  When 
0  — ►  0,  the  algorithm  is  reduced  to  the  standard  backpropagation.  When  0  — ►  oo. 
we  have  a  pure  simulated  annealing  algorithm. 

6.3     Random  Search 

Random  search  methods  are  widely  used  in  solving  global  optimization  problems. 
Salient  features  of  random  search  methods  include  simplicity  for  implementation, 
robustness  in  solving  problems  with  different  objective  functions,  and  wide  applica- 
bility. However,  random  search  methods  are  generally  not  efficient.  Convergence  of 
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these  methods  is  often  established  in  a  probabilistic  sense  based  on  large  samples. 
Currently,  there  is  no  theoretic  guideline  as  to  when  random  search  is  desirable  in  an 
optimization  problem  (Torn  and  Zilinskas,  1989,  p.94).  The  rule  of  thumb  is  to  use 
random  search  methods  when  no  efficient  deterministic  approaches  are  available. 

Classic  random  search  (Matyas,  1965)  was  used  to  find  global  optimal  solutions 
in  a  compact  set  W.  The  method  involves  the  following  three  steps. 

1.  Generate  an  initial  feasible  random  vector  iu°  €  W.  Set  the  iteration  count  to 
n  =  0. 

2.  Generate  a  random  vector  £n  from  a  N(0,  a)  distribution,  where  a  is  the  vari- 
ance of  the  Gaussian  distribution,  and  is  empirically  determined. 

If  wn  +  £"  €  W  and  F(wn  +  (n)  <  F{wn),  then  wn+1  =  wn  +  (n. 
Otherwise  wn+l  =  wn 

3.  n  =  n  +  1. 

If  F(wn)  <  Fstop,  stop.  Otherwise,  go  to  step  2. 

Baba  (1989)  used  a  modified  version  of  the  above  algorithm  for  optimal  training  of 
multilayered  feedforward  neural  nets.  In  the  modified  random  optimization  method 
(MROM),  a  Gaussian  random  vector  <fn  with  nonzero  mean  is  used.  The  mean,  6. 
of  the  Gaussian  distribution  is  dynamically  varied  according  to  the  effect  of  weight 
changes  to  the  criterion  function.  The  algorithm  is  given  below. 
Algorithm  MROM  (Baba,  1989) 

1.  Select  an  initial  random  vector  iv°  G  W .    Set  n  =  0,  b°  =  0,  and  a  to  some 
prescribed  value. 

2.  n  =  n  +  1.  Generate  a  perturbation  vector  from  N(Q,cr). 

(1)  If  wn  +  (n  e  W  and  F(wn  +  (n)  <  F{wn), 
then  wn+1  =  wn  +  £n  and  6n+1  =  0.4{n  +  Q.2bn. 

(2)  If  wn  -  £"  e  W  and  F{wn  -  £n)  <  F(wn), 
then  wn+1  =wn-(n  and  bn+1  =  -QA(n  +  bn. 

(3)  Otherwise,  tyn+1  =  wn  and  bn+1  =  0.5bn. 
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3.  If  F(wn)  <  Fstop,  stop.  Otherwise,  go  to  step  2. 

Sun  et  al.  (1990)  further  improved  the  random  search  method  by  introducing 
a  dynamic  variance  for  the  random  vector  generating  distribution.  The  heuristic 
random  optimization  method  (HROM)  was  reported  to  increase  learning  speed  by  a 
factor  of  4  as  compared  with  Baba's  algorithm,  for  the  6-bit  parity  problem. 

Random  search  methods  can  be  improved  with  BP-like  local  search  procedures. 
Singlestart  and  multistart  methods  combine  the  random  search  with  local  refinement. 
In  a  singlestart  method  a  single  local  search  is  carried  out  from  the  best  solution 
obtained  among  a  set  of  random  points.  Multiple  local  searches  are  performed  in 
multistart  methods.  Multistart  methods  are  useful  when  the  weight  space  exhibits 
complicated  shapes.  This  can  be  identified  when  the  solutions  from  the  singlestart 
approach  have  a  large  variance. 

Multistart  can  be  incorporated  into  the  BP  training  algorithm  to  reduce  the  prob- 
ability of  the  solution  getting  stuck  in  a  local  minima.  Suppose  for  a  given  problem 
that  the  classic  BP  will  have  a  20%  chance  of  getting  stuck  in  a  local  minimum,  a 
multistart  version  of  BP  with  5  initial  points  will  reduce  the  probability  of  obtain- 
ing a  local  minimum  solution  to  less  than  0.1%.  Implementation  of  a  multistart-BP 
algorithm  is  a  direct  extension  of  the  standard  BP  algorithm  presented  in  Chapter  3. 

6.4     Clustering  Methods 

Clustering  is  a  statistic  technique  that  is  used  to  identify  the  similarity  relationship 
among  given  data.  In  the  context  of  global  optimization,  clustering  methods  are  a 
class  of  global  optimization  methods  that  employ  a  cluster  analysis  of  the  sample 
points  and  group  the  sample  point  around  local  minima.  Through  cluster  analysis 
the  regions  of  attraction  of  local  minima  are  detected.  Local  search  is  then  performed 
in  these  identified  regions.  Becker  and  Lago  (1970)  were  the  first  to  apply  clustering 
method  in  global  optimization.  Their  algorithm  is  summarized  below. 

1.  Sampling:  Obtain  N  sample  points  of  W  through  random  sampling.  Retain  M 
of  the  N  sample  points  with  lowest  criterion  function  value. 


97 


2.  Clustering:  Cluster  the  M  points  by  a  mode-seek  algorithm. 
Sort  the  the  clusters  by  the  lowest  function  value  in  each  cluster. 

3.  Recursion:  Construct  subregions  Wl ,  W2, ...,  Wk  based  on  the  sorted  clusters. 
Each  subregion  contains  all  retained  points  of  one  cluster.  Repeat  step  1  and 
step  2  on  each  of  the  subregions,  starting  with  W1  (which  contains  the  lowest 
function  value  so  far). 

Note  that  the  member  of  subregions  used  in  step  3  may  be  limited  to  only  the 
best  few.  A  more  general  approach  would  retain  all  the  subregions  and  perform  a 
depth  first  search  until  a  satisfactory  solution  is  found.  Torn  (1973)  incorporated 
local  search  into  the  clustering  method.  Instead  of  successively  constructing  new 
subregions,  a  local  search  method  was  applied  to  each  of  the  subregions  in  step  2. 

Many  variations  of  the  clustering  methods  have  been  developed  (Price,  1978; 
Boender  et  al.,  1982;  Timmer,  1984).  Clustering  methods  have  been  considered 
among  the  most  efficient  methods  for  global  optimization. 

The  ideas  of  clustering  methods  may  be  ported  to  neural  net  training.  We  can 
let  the  BP  algorithm  serve  as  a  subroutine.  Then  BP  may  be  called  to  start  at  the 
best  point  found  by  the  clustering  method.  Recall  that  the  initial  weight  set  has  a 
great  impact  on  the  performance  of  the  BP  algorithm  (cf.  Section  4.7).  Also  BP 
fails  to  converge  occasionally.  Thus  it  is  reasonable  to  consider  combining  BP  with 
clustering  analysis. 


CHAPTER  7 
DETERMINISTIC  GLOBAL  ALGORITHMS 

Unless  domain  specific  knowledge  is  properly  implemented,  most  stochastic  global 
optimization  algorithms  are  inefficient  in  determining  global  optimal  solution(s),  al- 
though good  solutions  may  be  found  with  considerable  effort.  This  is  even  more 
true  with  deterministic  approaches.  There  have  been  no  generally  applicable  effi- 
cient deterministic  global  optimization  algorithms.  However,  many  algorithms  that 
take  advantage  of  domain  specific  knowledge  have  been  developed  in  the  last  decade 
(Horst  and  Tuy,  1990).  The  problem  of  optimal  training  of  feedforward  neural  nets 
has  some  features,  such  as  Lipschitz  continuity  and  a  known  lower  bound  on  the 
error  function,  that  might  be  amenable  to  existing  global  optimization  methods.  In 
this  chapter  we  treat  neural  network  training  as  a  global  optimization  problem  and 
explore  ways  to  solve  the  problem.  Two  general  approaches,  namely,  branch  and 
bound  and  Lipschitz  optimization,  will  be  considered. 

7.1      Branch  and  Bound 

Horst  (1986)  adapted  the  branch-and-bound  (BB)  techniques  to  global  optimiza- 
tion problems.  They  developed  a  general  branch-and-bound  prototype  algorithm  that 
encompasses  a  variety  of  other  global  optimization  algorithms.  Implementation  of  the 
general  BB  algorithm  needs  to  take  into  account  the  specific  features  of  a  particular 
problem.  In  the  following  we  first  introduce  the  basic  concepts  of  the  BB  approach. 
Then  we  present  the  prototype  algorithm  and  related  convergence  theorems. 
7.1.1     Prototype  Branch  and  Bound 

Definition  7.1  (Partition)  Let  M  be  a  closed  set  in  Rs  and  let  I  be  a  finite  index  set. 
A  set  M  =  {Mi\i  £  /}  of  closed  subsets  of  M  is  said  to  be  a  partition  of  M  if 

M  =  U,eIMi 
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and 

Mi  U  Mj  =  dMi  u  dMj,      Vi,j  el,    i±  j 

where  dMi  denotes  the  (relative)  boundary  of  Mx . 

Let  F  :  D  €  Rs  — *  R  be  the  global  function  to  be  minimized,  and  let  M,  6  M. 
We  say  Mi  is  feasible  if  M{C\D  ^  0,  and  M,  is  infeasible  if  M,  nD  =  0.  Otherwise 
Mi  is  uncertain.  A  subset  is  active  if  it  is  feasible  or  uncertain. 

Let  n  be  the  iteration  index  of  the  BB  algorithm.  We  use  Mn  to  denote  the 
collection  of  active  subsets,  and  ctn,f3n  to  denote  the  upper  and  lower  bound  of  F, 
respectively,  at  iteration  n.  Following  the  formalism  of  Horst  and  Tuy  (1990,  p.  114) 
we  rewrite  the  prototype  BB  algorithm  below.  The  convention  that  infima  and  min- 
ima taken  over  an  empty  set  equal  +oo  is  observed. 

Algorithm  BB  (Prototype) 

1.  INITIALIZATION: 

•  Set  n  =  0. 

•  Choose  a  relaxed  feasible  set  A/0  2  D,  and  a  possibly  empty  feasible  set 
Sm0  C  D.  Set  Mo  =  {M0},  and  find  upper  and  lower  bounds  associated 


with  M0.  That  is,  a0  =  a(M0),/30  =  /?(M0),  satisfying 
&  <  min  F(D)  <  a0  =  min  F(SMo). 

•  If  oc0  <  oo,  then  choose  the  current  best  solution  x° 
such  that  F(x°)  =  a0. 

•  If  a0  —  A)  =  0,  then  stop.  x°  is  a  global  optimal  solution. 
Otherwise,  go  to  Step  2. 

2.  RECURSION: 

Set  n  =  n  +  1. 

At  the  beginning  of  iteration  n,  the  current  partition  Mn-\  contains 

all  the  active  subsets  {M,|i  €  /n-i}- 

For  each  M,,i  €  In-i,  we  have  upper  and  lower  bounds  a(Mi)  and  /3(M,) 

satisfying: 

0(Mi)  <  min  F(Mi  C\  D)  <  q(M,)  =  min  F(SM.)  if  M,-  is  feasible  and 
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3(Mi)  <  min  F(Mi)  if  M,-  is  uncertain. 

Moreover,  we  have  the  overall  upper  and  lower  bounds  ocn_x,f3n_i  satisfying 

an_!  =  min  a(M{)  for  all  M,-  €  ^Mn_i; 

/?„_!  =  min  /3(Mi)  for  all  Mt  €  -M„_i;  and 

&_,  <  min  F(D)  <  On.:. 
Finally,  if  an_i  <  oo,  then  we  have  the  current  best  solution  xn~l 
satisfying  F(xn_1)  =  an_x 
2.1  Branching: 

•  Let  TZn  be  the  collection  of  all  subsets  Mt  €  Mn-\  such  that  /?(M,-)  <  o„_i, 
i.e.,  retaining  only  subsets  that  are  still  of  interest. 

•  Select  a  nonempty  collection  of  sets  Vn  C  7Zn  and  partition 

each  member  of  Vn.  Let  Vn  be  the  collection  of  all  the  newly  formed  subsets. 

•  Let  Mn  be  the  collection  of  all  the  subsets  M,-  €  V'n  such  that  M,  is 
active  (feasible  or  uncertain). 

2.  Bounding: 

•  For  each  M,  €  M'n,  find: 

a  feasible  subset  Sm,  €  A/,  f)  Z)  (if  possible); 

P(Mi)  =  min  F(M,  n  £>)  if  M,  is  feasible; 

/?(A/,-)  =  min  F(M,-)  if  M,  is  uncertain;  and 

a(M,)  =  minF(5Ml). 

•  Set  >tn  =  {fcn\Vn)  U  A^^,  i.e.,  merge  all  subsets  still  of  interest.  Let 
an  =  min  a(M{)  for  all  Mt  6M„  and 

$n  =  min  /3(M.)  for  all  M,  €  A4n. 

•  Update  the  current  solution: 

If  an  <  oo,  let  xn  e  D  such  that  F(xn)  =  an. 
3.  STOPPING  CONDITION: 

If  an  -  /3n  =  0,  then  stop.  an  =  (3n  =  min  F(D). 
xn  is  a  global  optimal  solution. 
Otherwise,  go  to  Step  2. 
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Remarks: 

1.  The  prototype  BB  algorithm  leaves  many  implementation  details  to  be  de- 
termined by  specific  applications.  The  efficiency  and  convergence  of  the  BB 
procedure  depends  on  three  important  questions:  (1)  how  to  carry  out  the  par- 
tition; (2)  how  to  determine  the  bounds  (bounding);  and  (3)  how  to  choose  the 
collection  of  subsets  Vn  for  further  partitioning  (branching). 

2.  The  stopping  condition  an  —  f3n  =  0  can  be  relaxed  to  an  —  /?„  <  e,  where 
e  €  B+  is  small,  e  is  called  the  error  tolerance. 

3.  Since  {an}  is  nonincreasing  and  {j3n}  is  nondecreasing,  the  limit  a  =  lim^oo  an 
and  /?  =  lim^oo  j3n  exist.  Also  fl  <  rain  F(D)  <  a  by  the  construction  of  the 
algorithm. 

4.  For  neural  network  training,  it  is  easy  to  find  Sm  and  hence  an.  Specific 
algorithms  are  needed  to  find  the  lower  bound  fin.  Lipschitz  continuity  may 
be  used  in  finding  (3n.  The  existence  of  a  known  lower  bound  (3  =  0  (for 
feedforward  neural  net  with  MSE  criterion  function)  may  be  used  in  evaluating 
an  and  hence  provides  an  easily  checked  stopping  condition. 

7.1.2     BB  Algorithm  Convergence 

In  order  to  establish  convergence  results  for  BB  based  neural  net  training  algo- 
rithms, several  definitions  and  convergence  theorems  due  to  Horst  and  Tuy  (1990) 
are  presented  below. 


Definition  7.2  (Consistent  Bounding)  A  bounding  operation  is  consistent  if  at  every 
iteration  n  any  active  partition  element  can  be  further  refined,  and  if  any  decreasing 
sequence  {Mnq}  of  successively  refined  partition  elements  satisfies 

Bm  (an,  -  /3(M„,))  =  0.  (7.1) 

If  {Mnq}  is  finite,  then  the  bounding  operation  is  called  finitely  consistent. 
Note  that  (7.1)  is  implied  by  the  more  easily  checked  condition 

lira  (a(Mn,)  -  /?(Mn,))  =  0. 
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Definition  7.3  (Complete  Branching)  A  branching  operation  is  complete  if  in  the  end, 
for  every  feasible  partition  element  M  6  Uflj  H^L    TZn,  we  have 

min  F(M  n  D)>  a=  lim  cvn. 

n— oo 

That  is,  any  unexplored  feasible  partition  elements  can  not  contain  a  better  solution. 

Definition  7.4  (Bound  Improving  Branching)  A  branching  operation  is  bound  improv- 
ing if,  at  least  each  time  after  a  finite  number  of  iterations,  one  of  the  partition 
elements  with  the  best  lower  bound  is  selected  for  further  partition.   This  requires 

Vn  n  argmin  {£(M,)|itf,  €  Hn  }  £  0. 

Note  that  if  the  bounding  operation  is  consistent,  then  a  bound  improving  branching 
operation  is  also  complete.  This  follows  from  the  definitions. 

Theorem  7.1  In  an  infinite  branch  and  bound  procedure,  suppose  the  bounding  oper- 
ation is  consistent  and  the  branching  operation  is  complete,  then 

a  =  lim  an  =  lim  F(xn)  =  nun  F(D). 

n—too  n—oo  ' 


Corollary  7.1  Let  F  :  Rs  — ►  E.  be  continuous.  D  be  closed  and  C(x°)  =  {x  e 
D\F(x)  <  F(x  )}  be  bounded.  Sxippose  the  bounding  operation  is  consistent  and 
the  branching  operation  is  complete  in  a  infinite  branch  and  bound  procedure,  then 
every  accumulation  point  x  of  {xn}  satisfies 

F(x)  -  min  F(D). 


Theorem  7.2  In  an  infinite  branch  and  bound  procedure,  suppose  the  bounding  oper- 
ation is  consistent  and  branching  operation  is  bound  improving,  then  the  procedure  is 
convergent.    We  have 

a  =  lim  an  =  lim  F(xn)  =  mm  F(D)  =  lim  0n  =  8. 
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7.2     Lipschitz  Optimization 

In  this  section  we  consider  global  optimization  of  a  wide  class  of  functions — the 
Lipschitz  functions.  We  start  with  an  introduction  of  the  characteristics  of  univariate 
Lipschitz  optimization  problem.  A  classic  univariate  global  optimization  algorithm — 
the  Piyavskii's  algorithm — is  presented,  which  is  to  be  expanded  to  deal  with  neural 
network  training.  The  convergence  of  the  Piyavskii's  algorithm  is  discussed  under 
the  general  framework  of  branch  and  bound. 

Definition  7.5  (Lipschitz  function)  A  continuous  function  F  :  M  — ►  R,  M  C  Rs  is  a 
Lipschitz  function  if  there  exists  a  constant  L  =  L(F.M)  >  0  such  that 

\F(x)  -  F(y)\  <  L\\x  -  yl     Vs,y€M 

where  L  is  called  a  Lipschitz  constant. 

Knowing  the  Lipschitz  constant  of  a  function  F  provides  a  way  of  computing 
lower  bounds  on  the  global  minimum  of  F.  Suppose  we  want  to  minimize  F  over  M, 
let  S(M)  =  max  {\\x  -  y\\  \x,y  €  M)  be  the  diameter  of  M.  From  the  definition  of 
Lipschitz  function,  we  have 

F{y)  >  F(x)  -  L\\x  -  y\\  >  F{x)  -  LS(M),    Vx,y  €  M. 

If  F(x)  is  known  for  some  x  €  M,  then  F(x)  -  LS(M)  gives  a  lower  bound  to  the 
global  minimum  of  F  over  M. 

A  univariate  Lipschitz  (UL)  optimization  problem  is  to  find  an  optimal  value  of 
F  over  M,  where  M  =  [a,b]  is  an  interval  in  R.  Let  $l[c,6]  denote  the  class  of 
all  Lipschitz  functions  on  [a,b]  with  Lipschitz  constant  L.  The  following  theorem 
(Hansen  et  al.,  1989)  shows  that  it  is  extremely  difficult  to  find  an  exact  global 
optimal  solution  of  problem  (UL). 

Theorem  7.3  There  is  no  algorithm  that  solves  every  problem  (UL)  in  $i,[a,&]  in  a 
finite  number  of  iterations. 

Proof: 

To  the  contrary,  assume  that  there  is  a  finite  convergent  algorithm  LA  that  solves  an 

(UL)  problem  in  n  steps.  We  have  F(x*)  =  min  F(x{),  for  i  =  l,2,...,n,  n  >  1. 
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F(x) 

n 


V 


a  x1  x    x*  b 

Figure  7.1.  Univariate  Lipschitz  optimization 

Denote  Xk  =  {x\x2,  ...,xk}  and  let  F{Xk)  be  the  set  of  corresponding  function 
values.    Let  xJ  be  the  evaluation  point  closest  to  x*  on  the  left  (if  there  is  no  such 
point,  choose  xJ  on  the  right  of  a;*). 
Consider 

P(x)  =  i  max  {F(x*)  -  L(x  -  x>),  F(x')  -  L(x  -  x')}      if  x  €  [**',x*] 
[  F(x)  otherwise. 

Clearly,  F  €  $l[<i,&],  since  |F'|  <  L.  F  attains  its  minimum  at 

g8SfL±g   |   F^)-F(x') 
2  21 

using  a  geometric  argument  as  shown  in  Figure  7.1.  We  now  have 

F(x)  =  l[F(*t)  +  F(x')  -  L(x*  -  **')]  <  F(x'). 

Since  the  strategy  of  algorithm  LA  depends  only  on  L,  Xk,  and  F(Xk),  which  coincide 
for  F  and  F.  Hence 


F(xm)  =  mt'n  F(xJ)  =  mm  F(xj)     j  =  1,2, ...,  fc. 

Algorithm  LA  concludes  x"  is  also  a  global  minimizer  for  F(x),  which  is  a  contradic- 
tion. This  proves  the  theorem.  □ 
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A  variation  of  the  result  due  to  Hansen  et  al.  (1989)  states  that:  There  is  no 
algorithm  with  finite  convergence  for  (UL)  unless  the  function  F  satisfies  the  following 
condition 

38  >  0  ->  F(x)  =  F(x*)  -  L  \x-x*\,    Vxe  [x*  -  S,xm  +  S\  n  [a,  6]. 

Theorem  7.4  Let  e  >  0  be  a  real  number.    Every  problem  (ULC)  can  be  solved  by  a 
finite  algorithm,  where  problem  (ULe)  is  given  by 
Problem  (ULJ 

Find  x*  <E  [«,  b]  such  that  F*  =  F(x')  <  F'  +  e. 

Proof: 

Evaluate  F  at  the  equidistant  points 

(2«-l)«      . 

x*  =  a  +  ± —i-      i  =  l,2,...,fc 

Li 

where  k  =  p^^l.  Suppose  x*  e  [*'",x''+1],  then 

\F(x{)  -  F(x*)\    <    L\x'-x'\ 
<    L\x%+1  -  x»| 

D 
Many  algorithms  for  solving  problem  (ULe)  exist.   Hansen  et  al.   (1989)  present 
a  nice  survey  on  those  algorithms.     A  classic  approach  for  solving  (UL<,)  is  the 
Piyavskii's  (1972)  algorithm,  which  iteratively  constructs  a  linear  underestimate  func- 
tion of  F 

(j>n{x)  =  max  {F(x')  -  L\x  -  xl\  |  i  =  l,2,...,n}. 

(j)n  is  called  a  saw-tooth  cover  of  F,  because  of  its  shape.  The  Piyavskii's  algorithm 
is  outlined  below. 


F(x) 
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X5        X3        X4 


Figure  7.2.  Saw-tooth  construction  by  Piyavskii's  algorithm 


Algorithm  (Piyavskii) 

1.  INITIALIZATION: 

Set  n  =  1,  x1  =  a±*,  x£  =  x1,  and  F(  =  F(x').  Also  set 
4  =  Fe-^,and 
fa  =  Fix^-L]  x-x1  |. 

2.  RECURSION: 

•  Check  the  quality  of  the  current  solution  F(. 
If  F(  —  <j>e  <  e,  then  stop. 

Otherwise  find  xn+l  €  argmin  (f>n([a,b]). 

•  Update  the  upper  bound. 

If  F{xn+1)  -F(<e,  then  set  Ft  =  F(xn+1),  and  xf  =  xn+1. 

•  Update  the  lower  bound. 

Set  <£n+i  =  max  {F(xi)  -  L\x  -  x'\  \  i  =  1,2, ...,  rc  +  1}.  and 
<?i>€  =  min  4>n+1([a,b}). 

•  n  =  n  +  1.  Go  to  RECURSION. 

The  procedure  of  the  Piyavskii's  algorithm  is  illustrated  in  Figure  7.2.   Note  that  if 
e  =  0,  the  algorithm  converges  to  a  global  optimal  solution.    Thus  the  Piyavskii's 
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algorithm  also  solves  problem  (UL).  As  indicated  by  the  labeling  of  recursion  steps, 
Piyavskii's  algorithm  can  be  viewed  as  a  branch  and  bound  algorithm.  At  each  step  n 
a  lower  bound  point  xn+1  is  chosen  from  argmin  <f>n([a,  b]).  Thus,  in  the  terminology 
of  branch  and  bound,  the  branching  operation  is  bound  improving.  It  was  shown 
in  Horst  and  Tuy  (1990,  p.  602-604)  that  the  bounding  operation  is  consistent. 
Following  Theorem  7.1  and  Corollary  7.1,  the  following  proposition  holds. 

Proposition  7.1  The  Piyavskii's  algorithm  is  convergent  for  solving  problem  (UL).  In 
particular,  under  the  branch  and  bound  fram,ework,  we  have 

lim  fin  =  min  F([a,b])  -   lim  an 

n— >oo  n— -oo 

and  every  accumulation  point  of  {xn}  is  an  optimal  solution  of  problem  (UL). 
7.3     Estimate  the  Lipschitz  Constant  for  an  FNN 

With  standard  sigmoid  activation  function  and  linear  transfer  function,  an  FNN 
is  equivalent  to  a  composition  of  continuous  functions.  In  the  following,  we  will  show 
that  a  standard  FNN  is  Lipschitz  continuous  by  deriving  bounds  on  the  Lipschitz 
constant.  Knowing  the  Lipschitz  constant  enables  us  to  obtain  computable  lower 
bounds  for  minimizing  the  error  function  of  an  FNN. 

7.3.1     Some  Lemmas  on  Lipschitz  Constant 

The  following  lemmas  are  needed  before  we  show  the  propositions  that  give  easily 
computable  lower  bounds  on  the  Lipschitz  constant  for  an  FNN. 

Lemma  7.1  Let  f,  :  Rn  ->  R,  i  =  1,2,...,  k,  be  Lipschitzian  with  Lipschitz  constants 
Li,  respectively.  Then  F  :  Rn  ->  R,  given  by  F  =  J2i  fi  «  also  Lipschitzian,  and  the 
Lipschitz  constant  of  F  is  given  by  Lp  =  Yli  Li- 

Proof: 

Pick  any  x,  y  €  S  C  Rn,  we  have 

\F(x)-F(y)\    =     \J2fi(x)-J2fi(y)\ 

i=i  .=i 

i=l 
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<     £A||x-?y||  (7.2) 

1  =  1 

a 

Lemma  7.2  Let  x  €  Rn ,  and  F(x)  =  f(g(x)),  where  f  :  R  ^  R,  g  :  Rn  ^  R 
are  Lipschitzian  with  Lipschitz  constant  Lf  and  Lg,  respectively.  Then  F(x)  has  a 
Lipschitz  constant  Lp  given  by  Lp  =  LjLg. 

Proof: 

Pick  any  x,y  €  Rn  and  any  s,t  G  R.  By  definition  we  have 


and 


Thus 


\9(x)  -  g(y)\  <  Lg\\x  -  y\\ 


\f(s)-f(t)\<Lf\s-t\ 


\F(x)-F(y)\     =     \f(g(x))  -  f(g(y))\ 

<  Lj\g(x)-g{y)\ 

<  LfLg\\x-y\\.  (7.3) 


□ 


Definition  7.6  (lr  norm)  Let  x  €  Rn ,  the  lp  norm  on  Rn  is  a  function  f  :  Rn  ->  R 
such  that 

/(*)  =  11*11,  =  (£  Wp)1/P- 


1=1 


Lemma  7.3  Let  x  <=  Rn ,  the  lp  norm  on  Rn ,  for  1  <  p  <  oo,  satisfies 

n 

11*11?  <  \\x\U  =  X^lx'l 


1=1 
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Proof: 

Pick  any  p  from  /  =  {1,2, ...,  oo},  we  have 


=  [(£wn> 

i=i 

>      [CtH)P+U-«-l\P+\Xn\P}' 

1=1 


>  [kiip  +  Ni"  +  ---  +  ixnn? 


=       ||X||p. 


(7.4) 


The  third  step  and  what  follows  it  are  obtained  by  applying  the  binomial  expansion. 
Let  a,b  £  R  and  a,  b  >  0,  k  >  0  be  an  integer,  the  following  holds 

(«  +  &)*     =     ak  +  kak~H  +  W^±ah-W  +  ■  ■  ■  +  kabk~l  +  bk 

>    «*  +  &*•  (7.5) 

D 

Lemma  7.4  Let  x  g  i?n,  «nJ  F(.t)  =  f(g(x)),  where  f  :  R™  ->  R  is  Lipschitzian 
with  Lipschitz  constant  Lj,  g  :  Rn  -*  Rm  with  components  <j{,i  =  1,2,  ...,m  being 
Lipschitzian  with  Lipschitz  constant  Lgi.  Then  F(x)  has  a  Lipschitz  constant  LF 
given  by  LF  =  Lf  ££,  L3x. 

Proof: 

Let  x,y  G  Rn,  we  have 

\F(x)-F(y)\    =     \f(g(x))  -  f(g(y))\ 

<    Lf\\g(x)-g{y)\\.  (7.6) 
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By  Lemma  7.3,  for  1  <  p  <  oo,  and  z  €  Rm,  the  lp  norm  satisfies 

ro 

NIp<Ni  =  EN-  (7-<) 

Thus,  (7.6)  becomes 

|F(«)  -  JP(y)|    =    |/C?(*))  - /(y(y))| 

<  I/|k(xj-flr(y)|| 

771 

<  L/£|ft(»)-»WI 

t=i 

<  LfZLgi\\x-y\\ 


=  (^E^.)lk-yll-  (7.8) 

;=i 

7.3.2     An  FNN  is  Lipschitzian 

Let  /(jy,.Y)  be  the  mapping  representing  an  FNN,  where  W  is  the  weight  set 
(bounded)  and  X  is  the  training  set.  Since  f(W,X)  is  continuously  differentiable, 
f(W,X)  is  Lipschitzian  if  its  gradient  is  bounded.  That  is,  if 

L  =  max  {\\Vwf{w,x)\\  \w  €  W} 

is  finite. 

Proposition  1.2  The  standard  sum  of  squared  error  (SSE)  criterion  function  of  a  feed- 
forward neural  network  with  linear  transfer  function  and  sigmoid  activation  function 
is  Lipschitz  continuous  if  the  weight  set  is  bounded. 

Proof: 

Consider  a  3-layer  feedforward  neural  net.  Let  i,j,k  be  the  indices  for  the  neurons 
in  input  layer,  hidden  layer,  and  output  layer,  respectively  (cf.  Figure  3.6).  We  have 

P  ^    p    K 

f  =  Efp  =  oE E('p* - opk{w,x))\   vu? € w 

P=l  L  P=l  k=\ 


Ill 


where  p  is  the  index  for  training  patterns,  and  opk(w,X)  is  the  network  output  at 
output  unit  k.  For  any  given  pattern  p,  we  have 

^=4£('*-0*)2-  (7.9) 

-    k 

Note  that 

Ok  =  /(£  *<W(E  wHxi))  (7-10) 

3  i 

where  /  is  the  sigmoid  function.  Taking  partial  derivative  of  Fp  w.r.t  ivq  gives 

flf  =  E'*-0*iT'  (7.11) 

If  wq  €  Wo  (an  output  layer  weights),  then 
BF 

^f  =  (tk  -  ojfe)ofc(i  -  «>fcb/(E  wy'O)  <  |-  (7.12) 

This  is  because  the  sigmoid  function  /  has  a  range  of  (0, 1);  ok(l  -  ok)  is  maximal  at 

ok  =  1/2;  and  (<*  -  ok)  €  (0,1). 

If  u;,  6  Wh  (a  hidden  layer  weights),  then 

dFp      ^  72 

■fa-  =  2J'*  ~  ok)ok{\  -  ok)-iwkOj{\  -  0j)jXi  <^u;i  (7.13) 

*  A: 

since  the  input  x,  €  [0,1].  Combining  (7.12)  and  (7.13)  we  have 

OF  7   72 

_<TOM{      -£>*}  =  L,.  (7.14) 

'  k 

Let  Lp  =  max  ||VFP||.  Then  we  have 

LP  =  {Y.LDh-  (7.15) 

9 

By  applying  Lemma  7.1,  the  Lipschitz  constant  of  the  FNN  is  bounded  by 

L{F)  <  npLp  (7.16) 


where  np  is  the  number  of  training  patterns.  Extension  of  the  boundedness  of  L(F) 
to  neural  nets  with  more  than  one  hidden  layer  is  straightforward.  □ 


112 


7.3.3     Local  Lipschitz  Constant 

By  the  results  from  Section  7.3.2,  an  upper  bound,  say  Z,',  of  the  Lipschitz  con- 
stant for  a  given  FNN  can  be  estimated  with  given  bounds  on  the  weights.  However, 
the  usage  of  this  bound  is  limited  to  FNNs  with  known  bounded  weights  sets.  Fur- 
thermore, it  is  usually  too  loose  to  be  practically  useful  in  obtaining  good  lower 
bounds  for  the  error  minimization  problem.  By  exploiting  the  special  properties  of 
the  structure  and  the  activation  functions  in  an  FNN,  we  are  able  to  derive  bounds 
on  Lipschitz  constants  that  do  not  explicitly  depend  on  the  weight  set.  More  im- 
portantly, this  approach  allows  us  to  estimate  Lipschitz  constants  on  subsets  of  the 
weight  space,  and  hence  makes  it  possible  to  obtain  good  Lipschitz  constants  and 
tight  lower  bounds  on  the  global  function  in  the  subregions  of  the  weight  space. 

In  the  following  discussion,  we  assume  the  standard  sigmoid  activation  function  / 
(with  range  (0, 1.0))  is  used.  The  transfer  function  is  a  linear  function  of  the  inputs 
from  the  previous  layer  with  a  constant  term  (the  bias).  L  is  used  to  denote  Lipschitz 
constant  with  subscripts  identifying  the  corresponding  functions.  Let  us  consider  first 
the  case  where  the  neural  network  has  n  input  units  and  a  single  output  unit. 

No  Hidden  Layer 

Let  o  be  the  output  of  the  network.    For  an  FNN  without  a  hidden  layer,  o  is 

given  by 

1 


o  =  f(w,x)  =  /(J^  w,xz  +  w0)  = 


1=1 


for  any  input  pattern  x  E  Rn.  Let  F  be  the  evaluation  function  of  the  neural  network. 
If  F  is  the  sum-of-squared  error  function,  we  have 


^=£>P4f>P-»P)2- 


p     i  p 

p=l  ~  p=l 

For  each  pattern  p,  by  Lemma  7.2  we  have  LFp  =  LsLln  where  Ls  is  the  Lipschitz 
constant  of  the  sum-of-squared  error  function,  and  L0  is  the  Lipschitz  constant  of 
the  network  output  function.1  To  estimate  a  Lipschitz  constant  of  /  as  a  function  of 

*For  simplicity,  we  omit  the  subscripts  p  for  L,  and  L„,  with  the  understanding  that  they  also 
depend  on  the  input  pattern  Xp. 
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weights  w,  we  use 

L0  =  max  \\Vwf(w,x)\\,     Vw  g  W. 

Note  the  maximum  exists,  as  we  can  show  that  ||V,„/(u;,x)||  is  finite.    For  a  given 
input  x  (adding  x0  =  1),  we  have 

dwi 
Thus  L0  can  be  obtained  bv 


=  70(1  -  o)x{,     i  =  0,1,2,  ...,n. 


11 

L0  =  max  70(1  -o)(l  +  J2x?)'.  (7.17) 

Ls  can  be  obtained  by 

Ls    =    max  ||V0FP|| 

=     max  \tp-op{w,Xp)\,    VwEW.  (7.18) 

Applying  Lemma  7.1  and  combining  the  above,  the  Lipschitz  constant  for  an  FNN 
without  a  hidden  layer  is  given  by 

Lf    =    ELFp 
P=\ 
p 

p=l 
p 

=     J2lmax  \tP-op\][max  7op(l-  or)(l +  ^x*)i],    Vu;  €  W.     (7.19) 
p=i  (=1 

One  Hidden  Layer 

For  a  single-output  FNN  with  one  hidden  layer,  the  output  of  the  network  is 

h  n 

0  =  f(w,  x)  =  f(J2  Vjfjdl  w'jx>j  +  wQj)  +  wo) 
j=i  >=i 

where  h  is  the  number  of  hidden  units,  and  J)s  are  activation  functions  in  the  hidden 
layer.  Note  that  the  output  o  can  be  written  as  a  composite  function  o  =  f(g(w,x)), 
where 

h  n 
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Applying  Lemma  7.2,  we  have 

Lf  is  given  by 

Lf     =     max  \\Vgf(w,x)\\ 

=     max  7/(l-/),    VioeW.  (7.20) 

The  function  g  can  be  rewritten  as  g0{fH),  where  g0  :  Rh  — ►  R  transfers  the  hidden 
layer  output  to  the  output  layer  input.  g0  can  be  written  as 

9o  =  WHfH  +  w0 

where  WH  is  the  set  of  weights  between  the  hidden  layer  and  the  output  layer,  and 
wQ  is  the  output  unit  bias.  fH  :  Rn  ->  Rk  maps  the  output  from  the  input  layer  to 
the  input  of  the  hidden  layer.  The  components  of  fH  are  given  by 


ff  =  fiiJl  WiiXii  +  ";°j)'    3  =1,2, ...,  h. 


t=i 


Applying  Lemma  7.4,  we  have 


h 

3=1 


where  L9o  is  given  by 


Lgo     =     max  \\Vfg0\\ 
h 
=     m«x(l  +  ^//)i.  (7.21) 

Note  that  ff  in  the  hidden  layer  is  equivalent  to  the  output  function  of  an  FNN 
without  a  hidden  layer.  Hence,  L}H  can  be  estimated  using  equation  7.17.  We  have 

n 

LfH  =  max   7/i(l  -fi)(l+Y,x*)>- 

i=i 

Putting  the  above  together,  we  have,  for  a  single  hidden  layer  FNN, 

h  h  n 

L0  =  max   7/(1  -/)  max  (1 +  £/?)*   £  max7/J(l-/J)(l  +  ^^     (7.22) 
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and 

LFp  =  max    \tp  -  op\L0p,    \/iv  €  W, 

where  L0p  is  given  by  equation  7.22  with  the  input  Xp.  We  observe  that  /  and  /,s  are 
functions  of  the  weights  and  the  maximization  is  taken  over  the  whole  weight  space, 
although,  with  the  layered  structure,  /js  depend  only  on  hidden  layer  weights.  Recall 
that  LF  =  Y,pLfp,  thus  we  have  developed  a  procedure  for  estimating  the  Lipschitz 
constant  for  FNNs  with  a  single  output  unit  and  a  single  hidden  layer. 

Multiple  Output  Units 

Equation  7.22  can  be  used  in  estimating  the  Lipschitz  constant  for  a  general  three 
layer  FNN  (Which  is  the  most  widely  used  NN  structure).  Let  k  be  the  index  for  the 
output  processing  units,  then  for  each  output  unit  ok,  we  have 

h  h 

L0k=max  1fk{l-fk)max{l^Y.iD1  £  max  ~tfj(l  ~/i)(l  +  £  **)*•  (7-23) 

3=1  3  =  1  1=1 

Consider  the  criterion  function 

F=EFP=lf:t(tP*-oPk)\ 

P=l  "  p=l  i=l 

for  each  training  pattern  p,  Fp  =  fs(f0),  where  fs  :  RK  -*  R  maps  the  network  output 
to  a  performance  measure,  and  /„  :  Rh  -»  RK  maps  the  hidden  layer  output  to  the 
input  to  the  output  layer.  Observe  that  each  component  of  f0  is  equivalent  to  the 
output  function  of  a  three  layer  FNN  with  a  single  output,  the  case  discussed  in  the 
above  subsection.  Let  ok,k  =  1,2,...,  A'  denote  the  component  function  of  f0,  ok  is 
Lipschitzian  with  Lipschitz  constant  L0k  given  by  7.23.  By  Lemma  7.4,  the  Lipschitz 

constant  for  Fp  is 

K 

LPp  =  Lh  J2  Lok 
k=i 

where  Ljt  is  given  by 


Lj,     =    max  ||V0FP|| 
K 
=    max  (%2(tpk  -  Ppkf)2 .  (7.24) 
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Thus  for  the  criterion  function  F,  we  have  a  Lipschitz  constant  (using  Lemma  7.1 

again) 

P  K  k 

LF  =  YJ  max  (52{tpk  -  Ppk)2)*  J2  Lok  (7-25) 

p=i  k=i  k=i 

This  leads  to  the  following  proposition. 

Proposition  7.3  The  criterion  function  representing  a  three  layered  feedforward  neural 
network  is  Lipschitzian  with  a  Lipschitz  constant  given  by  equation  7.25. 

Extension  of  the  procedure  to  estimating  Lipschitz  constant  for  an  FNN  with  more 
than  one  hidden  layers  can  be  carried  out  by  applying  the  basic  lemmas  recursively, 
as  illustrated  above. 

A  global  estimate  of  LF  can  be  obtained  by  noting  that  both  the  input  x  and  the 
activation  function  assume  values  in  [0, 1],  and  /(l  -  /)  has  a  max  value  of  \.  Thus 
for  a  three  layer  FNN, 

Lf    <     —PK\fK12\/lThhVU^ 
lo 

=     ^P/a7y/v(l  +  /0(l+n).  (7.26) 

Much  tighter  estimates  for  the  Lipschitz  constant  can  be  obtained  by  computing  it 
over  subsets  of  the  weight  space.  The  computation  involves  performing  maximization 
on  the  weight  subsets.  It  turns  out  that  with  the  special  structure  and  the  sigmoid 
activation  functions  of  a  standard  FNN,  the  maximization  can  be  easily  implemented 
by  dividing  the  weight  space  into  hyper-rectangles.  In  this  case,  it  is  suffice  to  consider 
only  the  lower  and  upper  vertices  of  a  hyper-rectangle  in  determining  the  Lipschitz 
constant  over  the  weight  subset.  Implementation  of  this  procedure  is  discussed  in  the 
next  chapter. 

7.4     BB  Based  NN  Training  Algorithm 

We  have  shown  that  the  feedforward  neural  network  training  problem  can  be 
treated  as  a  Lipschitz  global  optimization  problem.  In  the  following,  the  proce- 
dures discussed  in  the  last  three  sections  will  be  combined  to  construct  a  branch  and 
bound  based  neural  network  training  algorithm  (BBBNNTA)  .  Since  the  acronym 


117 


BBBNNTA  is  unwieldy,  we  will  simply  call  the  procedure  the  global  optimal  training 
algorithm — GOTA  for  short. 

A  Lipschitz  constant  L'  for  a  given  FNN  can  be  estimated  with  the  procedure 
discussed  above.  L'  can  be  used  in  the  Piyavskii's  algorithm.  An  extension  of  the 
Piyavskii's  algorithm  to  an  S-dimension  rectangle  D  is  trivial,  but  the  resulting 
computational  complexity  is  prohibiting.  The  direct  extension  requires  computation 
of  order  0(ns),  where  n  is  the  number  of  evaluation  points  in  the  one  dimensional 
algorithm. 

We  adapt  a  diagonal  extension  of  the  Piyavskii's  algorithm  by  Horst  and  Tuy 
(1990).  Note  that  for  commonly  used  neural  networks  with  SSE  criterion  function, 
a  natural  lower  bound  j3  =  0  exists.  This  lower  bound  can  be  used  in  two  cases. 
(1)  At  each  iteration,  the  current  best  solution  wn  can  be  easily  evaluated  for  its 
quality,  as  measured  by  \F(wn)  -  0\.  (2)  Any  subregions  with  a  lower  bound  larger 
than  (3  + 1  can  be  ignored,  as  those  subregions  do  not  contain  a  solution  w  such  that 
F(w)  <  F{w*)  +  e. 

Let  W  =  {w  €  Rs\a  <  w  <  b},  where  the  inequalities  are  understood  to  mean 
componentwise  comparison.  We  will  assume  the  partition  operation  on  W  generates 
a  series  of  hyper-rectangles.  aM  and  bM  are  used  to  denote  the  lower  left  and  upper 
right  vertex  of  a  partition  element  M,  respectively.  We  list  the  global  optimal  train- 
ing algorithm  below. 

Algorithm  GOTA 

1.  INITIALIZATION: 

•  Set  n  =  0. 

•  Set  M0  =  W,  and  M0  =  {M0}. 

Find  upper  and  lower  bounds  associated  with  M0,  and 

a0  =  min  {F(a),F(b)}, 

0o  =  max  {F{a),  F{b)}  -  L'\\b  -  a\\ 
where  L'  is  an  estimate  of  L{F),  the  Lipschitz  constant. 

2.  RECURSION: 
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Set  n  =  n  ■+■  1. 

At  the  beginning  of  iteration  n,  the  current  partition  Mn-i  contains 

all  the  active  subsets  {Af,|i  €  /n-i}- 

For  each  M,-,»  6  /n-i,  we  have  upper  and  lower  bounds  a(M.)  and  /?(M.) 

satisfying: 

P(Mi)  <  min  F(Af.)  <  a{M{), 
and  current  bounds 

/3n_i  <vamF(W)<an-X. 
2.1  Branching: 

•  Let  TZn  be  the  collection  of  all  subsets  M,  €  >f„_i  such  that  0(M{)  <  an_j. 
i.e.,  retaining  only  subsets  that  are  still  of  interest. 

•  Select  a  nonempty  collection  of  sets  Vn  C  Mn  such  that 

Vn  n  argmm  {/3{Mi)\Mt  6  Tin}  ±  0. 
For  each  member  M  G  Vn,  choose 

wKi  =  (1/2)(«AV  +  bA)  +  {F{ba)  -  F(aJ^))(6Aj  -  a^)/(2L'\\bKl  -  aA\\). 
Wm  is  a  point  on  the  diagonal  line  of  the  hyper-rectangle  M  biased  towards 
the  end  point  with  lower  function  value.  Divide  M  into  two  hyper-rectangles 
with  the  dividing  hyperplane  passing  through  wA  and  orthogonal  to  the 
longest  edges  of  M. 

•  Let  Mn  be  the  collection  of  all  new  hyper-rectangles,  and  let 
M'  €  Vn  denote  the  parent  hyper- rectangle  of  M'  €  M'n. 

2.  Bounding: 

•  For  each  M'  e  M'n,  find: 

a(M')  =  mm  {F{aM,\F{bM,),F{w^,)},  and 

0(M')  =  max  {/3(M'),  max  {F(au>),F(bM.),F{wA,)}  -  V  \\bM>  -  aM.\\}. 

•  Set  Mn  =  {Tln\Vn)  U  M'n,  i.e.,  merge  all  subsets  still  of  interest.  Let 
an  =  min  a(M.)  for  all  M,  €  Mn,  and 

0n  =  min  P(Mi)  for  all  Mt-  6  Mn. 

•  Update  the  current  solution, 

Let  wn  €  W  such  that  F(wn)  =  an. 
3.  STOPPING  CONDITION: 
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If  an  <  a,  then  stop.  wn  is  a  satisfactory  solution. 
Otherwise,  go  to  Step  2. 


The  bounding  operation  of  GOTA  is  consistent  and  the  branching  operation  is  bound 
improving.  Hence,  following  Theorem  7.2,  the  procedure  is  convergent.  That  is,  when 
e  =  0,  we  have 

lim  an  =  min  F{W)  =  lim  Bn, 

and  every  accumulation  point  w"  of  {wn}  satisfies 

F{w')  =  min  F(W). 


CHAPTFP. 8 
IMPLEMENTATION  OF  GOTA 

We  discuss  the  implementation  issues  of  the  global  optimal  training  algorithm 
(GOTA)  in  the  following.  As  with  any  other  global  optimization  approaches,  the 
general  procedure  of  GOTA  would  not  be  effective  unless  the  domain  specific  knowl- 
edge of  the  neural  network  can  be  incorporated  in  the  search  procedure.  Several 
issues  are  critical  to  the  implementation  of  GOTA.  These  issues  include  generating 
partition  elements  (weight  subsets),  choosing  a  partition  element  for  further  inves- 
tigation (branching),  and  finding  lower  and  upper  bounds  of  the  criterion  function 
over  a  partition  element  (bounding).  The  search  space  of  an  FNN  is  generally  huge. 
Thus  the  success  of  GOTA  depends  largely  on  the  pruning  of  unpromising  subregions 
with  tight  lower  bounds.  This  in  turn  depends  on  good  estimation  of  local  Lipschitz 
constant  over  the  subregions.  Fortunately,  as  shown  in  the  following,  the  estimation 
of  local  Lipschitz  constant  is  feasible  and  computationally  efficient. 

8.1     Compute  Local  Lipschitz  Constant 

For  clarity  of  exposition,  we  will  consider  computing  the  Lipschitz  constant  of  a 
three  layer  FNN  with  a  single  output  unit.  The  extension  of  this  case  to  a  general 
FNN  is  straight  forward  as  discussed  in  Section  7.4.  Using  equation  7.22,  we  can 
compute  the  Lipschitz  constant  of  the  criterion  function  with  a  given  training  pattern 
pby 

h  k  n 

L0     =     max    7/(1  -  /)  max  (1  +£//)*   £   max  7/i(l  -  f3)(\  +  £>,2)* 
LFp    =    max    \tp  -  op\L0,     Vw  e  W 

Four  maximization  problems  need  to  be  solved  over  a  given  weight  subset.  Solving 
those  problems  may  seem  to  be  difficult  as  the  functions  are  nonlinear  and  nonconvex. 
However,  by  exploiting  the  properties  of  the  sigmoid  activation  function  and  the 
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special  structure  of  the  FNN,  we  can  effectively  solve  those  problems  over  a  weight 
subset,  if  the  weight  subset  is  a  hyper-rectangle  in  the  weight  space. 

Let  WPE  C  W  C  R*  (PE  is  for  partition  element)  be  a  hyper-rectangle  over  which 
LFp  is  to  be  computed.  Also,  let  w  and  w  be  the  upper  vertex  and  the  lower  vertex 
of  Wpe,  respectively,  w  and  w  are  defined  as 

w    =    {we  WPE  |  Wi  >  Wi,  i  =  1,2,  ....s,  Vu>  e  WPE} 

w    =    {we  WpE  |  m4<Wi, i  =  l, 2,.. .s,  Viw€  WpE}  (8.1) 

Lemma  8.1  For  a  standard  sigmoid  function  f(x)  =  1/(1  +  e~x)  over  a  finite  interval 
[a,  b)  G  R,  the  maximum  of  its  gradient,  or  its  Lipschitz  constant  Lf  is  given  by 

h  =  /(i-/) 

f  f'(a)      ifa>Q 
f\b)      ifb<0  (8.2) 

\  ifa<0<b 


Proof: 

The  standard  sigmoid  function  f(x)  is  monotonically  increasing.  The  gradient  of  f(x) 

is  a  quadratic  function  that  achieves  its  maximum  at  x  -  0.  For  any  x  €  [a,  b]  C  R, 

there  are  three  possible  cases. 

Case  1: 

0  <  a  <  b,f'(x)  is  monotonically  decreasing.  Thus  max  f'(x)  =  /'(a),  Vx  €  [a,  b]. 
Case  2: 

a<b<  0,/'(x)  is  monotonically  increasing.  Thus  max  f'(x)  =  f'(b),  \/x  €  [a,  b]. 
Case  3: 


a  <  0  <  b,  max  f'(x)  =  max  {/(l  -  /)  |  /  e  [0, 1]}  =  1. 


3 


Now  let  us  consider  the  four  maximization  problems  one  at  a  time.    First,  the 
problem 

P1=£  m«*7/i(l-/;)(l +  £>?)>. 
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For  a  given  input  pattern  Xp,  (1  +  £Li  x2)i  is  a  constant.  Since  f},j  =  1,2, ...,  h  are 
independent  (cf.  the  structure  of  a  FNN),  the  maximization  problem  can  be  solved 
independently  for  each  j.  By  Lemma  8.1,  maximizing  /,-(l  -  /_,)  is  determined  by  the 
interval  [a,-,  bj],  where 


n 

a,     =    mvn  "/jTwijXi  +  w0j,    Vw  €  WF 


bj     =     max  -fYlw^xi  +  wOj,    Viv  £  WPE.  (8.3) 

Since  input  x  g  [0. 1]",  we  have 

n 

.=i 

n 

h    =     fj^^^^  +  ^oj-  (8.4) 


i=l 


where  w  and  w  are  the  upper  and  lower  vertices,  respectively,  of  the  hyper-rectangle 
WpE.  Using  Lemma  8.1,  Pi  is  easily  computed  over  WPE. 

[  f-(aj)     ifay>0 
^     =         /J(M      ii^<0  (8.5) 

I  j  if  <*j  <  0  <  6j. 

Let 


P2  =  max  (1  +  J2  /?('«-  *))*,    V«j  €  VKP£. 

j=i 

By  the  monotonicity  of  the  sigmoid  function, 

where  6j  is  given  by  equation  8.4. 
For  the  third  maximization  problem 

P3  =  max  f(w,  x){l  -  f(w,  x)),    \fw  £  WPE, 
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we  have  f(w,x)  =  /(£?=i  Wjfj  +  w0),  where  fj  is  the  output  from  hidden  node  j. 
We  need  to  find 

h 


a    =     min  iC^Wjfj  +  w0),    \/w  <E  WPE 

h 
b    =    maxi(Y,wjf3+wo),    VweWPE.  (8.6) 

Since  fj  is  not  a  constant,  in  order  to  find  the  interval  [a,  b]  for  /,  we  need  to  partition 
the  lower  and  upper  vertices  of  the  current  hyper-rectangle  into  separate  sets.  Let  J 
be  the  index  set.  We  define 

J    =    {J€J\  Wj  >  0} 

J     -    {;  €  J  |  Wj  <  0} 

1  =  {j  eJ\wj>  o} 

/     =     {j  €  J  |  Wj  <  0}. 

(8.7) 
Then  the  input  interval  for  /  can  be  computed  by 

b    =    l{Y,™,fi{h)+  T,wifj(ai)  +  ™o).  (8.8) 

'&  jeJ 

Note  that  the  above  equations  are  based  on  the  facts  that  fj  €   [0, 1],  and  fj  is 
monotonic.  After  the  interval  [a,  b]  is  computed,  P3  is  determined  by 

/'(«)      if«>0 

f'(b)      if  /;  <  0  (8.9) 

\  if  a  <  0  <  b. 

For  the  fourth  maximization  problem 

P4  =  max  \tp  -  op(w,  x)\,    \/w  £  WPE, 
we  notice  that,  if  the  target  value  are  binary,  we  will  have 

da  —  )    *p  ~  f(a)      ^  tp  assumes  the  upper  value 
[  f(b)  —  tp      if  tp  assumes  the  lower  value. 
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Even  if  the  target  value  is  not  binary,  computing  P4  would  be  easy  as  the  interval 
[a,  b]  has  already  been  found  in  computing  P3,  and  only  the  end  points  of  the  interval 
need  to  be  evaluated. 

The  above  derivation  has  shown  that  the  four  maximization  problems  can  be 
efficiently  solved.  The  important  feature  of  this  procedure  is  that  the  computation 
of  Lipschitz  constant  is  dynamically  carried  out  at  subsets  of  the  weight  space.  It 
is  now  a  well  known  fact  that  FNNs  generally  have  large  areas  of  plateau  where  the 
gradient  is  extremely  small  (cf.  Figure  5.1).  This  has  resulted  in  the  ineffectiveness 
of  gradient  based  search  methods  in  those  areas.  With  the  local  Lipschitz  constant 
procedure,  we  would  expect  to  find  very  small  Lipschitz  constants  over  those  areas. 
Small  Lipschitz  constants  result  in  tight  lower  bounds  (see  discussions  in  Section  7.2), 
thus  making  it  possible  to  detect  and  eliminate  the  unpromising  subregions  in  the 
weight  space,  and  hence  reducing  ineffective  search. 

Let  us  apply  the  local  Lipschitz  constant  procedure  to  the  2x2x1  XOR  network. 
Assuming  7  =  1,  a  theoretic  global  Lipschitz  constant  can  be  computed  by 


=     I>, 
p=i 

4  k  h 

=    £ l<p-<*|  max  (1  +  £/?)*  Z  ™«7/i(l-/i)(l +£*?)* 


p=1  j=i  j=\ 

=    —  v/l  +  /t(l  +  v/2  +  V2  +  s/3) 
16 

=    ^-(l+2v/2+v/3) 

=    1.203878. 


(8.10) 


This  is  obtained  by  overestimating  —  taking  \tp  -  op\  =  1,  /(l  -  /)  =  1/4,  and 
V  *  +  £;=i  ff  =  vl+X  By  actually  maximizing  those  terms  over  a  given  weight 
subset,  we  may  get  much  smaller  local  Lipschitz  constants  than  the  global  one  for  each 
partition  element.  Table  8.1  shows  that  the  local  Lipschitz  constants  vary  significantly 
over  different  weight  subregions.  These  subregions  are  hyper-rectangles  identified  by 
the  lower  and  upper  vertices. 
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Table  8.1.  Lipschitz  Constant  over  Weight  Subsets 


Weight  Hyper-rectangle 

Lipschitz  Constant. 

LV=(-10  -10  -10  -10  -10  -10  -10  -10  -10) 
UV=(  10  10  10  10  10  10  10  10  10) 

1.20388 

LV=(  000  00  0-10-10-10) 
UV=(  10  10  10  10  10  10  10  10  10) 

0.89769 

LV=(  50000000  0) 

UV=(  10  10  10  10  10  10  10  10  10) 

0.01584 

LV=(  50050000  0) 

UV=(  10  5  5  10  10  10  10  10  10) 

0.00793 

LV=(  055  5  5055  0) 
UV=(  5  10  10  10  10  5  10  10  5) 

0.00792 

LV=(  05000000  0) 

UV=(  5  10  5  10  10  10  10  10  10) 

0.17889 

LV=(  05050000  0) 
UV=(  5  10  5  10  5  10  10  10  10) 

0.01167 

LV=(  0  0  0  0  0  0  -5  -5  -5) 
UV=(  5  5  5  5  5  5  5  5  5) 

0.89769 

LV=(  2.5  2.5  0  0  0  0  0  0  0) 
UV=(  55555555  5) 

0.05438 

LV=(  2.5  2.5  2.5  2.5  0  0  0  0  0) 
UV=(  5  5  5  5  5  5  5  5  5) 

0.00880 

LV=(  -5  -5  -5  -5  -5  -5  -5  -5  -5) 

UV=(  00000000  0) 

0.74146 
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8.2     Program  Design 

Although  there  are  many  commercial  and  public  domain  neural  network  simu- 
lators available  for  a  wide  range  of  neural  network  paradigms,  none  of  those  pack- 
ages provides  the  facility  for  implementation  of  our  branch-and-bound  based  global 
optimal  training  algorithm  (GOTA).  Hence  we  have  developed  two  neural  network 
simulation  packages  (in  C++).  One  for  general  feedforward  neural  networks  with 
backpropagation  based  learning  algorithms  (FNET),  and  the  other  for  branch-and- 
bound  based  learning  algorithms  (BB).  We  discuss  briefly  the  program  design  and 
implementation  issues  in  the  following. 

8.2.1     Object-oriented  Program  Structures 

For  the  feedforward  neural  network  simulator,  the  building  blocks  are  the  classes 
for  the  nodes  (neurons).  Three  classes  of  nodes,  namely,  Input,  Hidden,  and  Output, 
are  designed.  Hidden  nodes  and  output  nodes  are  the  processing  units.  They  contain 
the  neuron  activation  functions  and  connection  information,  as  well  as  the  weights  on 
the  connections.  Class  Network  is  derived  from  the  node  classes.  The  network  class 
has  methods  for  various  backpropagation  based  training.  Those  methods  include 
epoch  training,  sequential  pattern  training,  randomized  pattern  training,  and  quick 
propagation. 

The  branch-and-bound  simulator  has  three  basic  classes.  Class  PartitionElement 
is  used  to  hold  data  for  the  partition  elements  in  the  branch-and-bound  procedure. 
The  data  include  the  weight  subsets,  upper  and  lower  bounds  of  the  criterion  function 
over  the  weight  subsets,  etc..  PartitionElement  also  provides  methods  for  comput- 
ing local  Lipschitz  constants  and  finding  local  upper  and  lower  bounds.  Class  PList 
is  a  linked  list  of  partition  elements.  This  class  also  provides  methods  for  manipu- 
lating the  linked  list  through  adding,  deleting,  inserting,  appending  operations  etc., 
by  which  different  branching  strategies  can  be  implemented.  Finally,  the  class  BB 
implements  the  branch  and  bound  algorithm  with  methods  that  manipulate  the  Par- 
titionElement and  PList  objects. 

The  FNET  and  BB  programs  are  combined  to  create  a  program  that  implements 
the  global  optimal  training  algorithm  (GOTA).  A  Network  object  net  is  defined  which 
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specifies  the  neural  network  topology  and  functionality.  Each  partition  element  in  the 
branch-and-bound  procedure  has  access  to  net.  Hence  local  bounds  can  be  evaluated 
through  the  network  object.  Global  convergence  of  GOTA  depends  on  the  search 
strategies,  which  are  discussed  in  the  next  subsection.  Source  code  (in  C++)  of 
GOTA  is  listed  in  Appendix  A. 

A  general  neural  network  simulation  system  (NNET)  is  also  developed.  This 
program  consists  of  modules  made  of  basic  classes,  such  as  Link,  Node,  and  Structure. 
A  generic  neural  net  class,  NeuralNet,  is  constructed  using  the  basic  classes.  Other 
specifical  neural  net  classes  are  derived  from  NeuralNet.  Some  major  neural  net  sub- 
classes, such  as  the  feedforward  neural  network,  can  be  used  as  parent  classes  from 
which  more  algorithmic  based  neural  net  classes  inhere  the  structure  and/or  methods 
(functions).  A  separate  class  Interface  is  designed  to  provide  run-time  control  and 
access  to  the  neural  net  parameters  on  and  off-line.  The  class  definitions  are  presented 
in  Appendix  B. 

8.2.2     Search  Strategies 

Four  search  strategies  may  be  implemented  in  the  general  branch  and  bound  pro- 
cedure. The  four  strategies,  namely,  best  first,  depth  first,  breadth  first,  and  bounding 
improving,  differ  in  the  way  a  partition  element  is  chosen  for  further  partitioning. 

Best  first  search  chooses,  among  all  currently  active  partition  elements,  the  one 
with  the  best  current  solution  (upper  bound).  Depending  on  the  error  surface  in 
the  weight  space,  this  method  may  or  may  not  be  effective.  One  problem  with  it 
is  that  it  may  miss  a  subregion  containing  a  global  optimal  solution,  but  having  a 
relatively  large  upper  bound.  Thus  this  strategy  would  be  more  effective  when  the 
search  regions  are  relatively  small.  Global  convergence  is  not  guaranteed  with  this 
approach. 

Depth  first  search  focus  search  within  the  current  subregion.  It  keeps  cutting  the 
current  partition  element  into  smaller  and  smaller  pieces.  This  method  is  generally 
not  effective  unless  good  pruning  methods  are  used  in  conjunction  with  it.  This  search 
method  would  be  more  useful  for  problems  where  relatively  many  global  optimal 
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solution  exist.  Depth  first  search  is  not  globally  convergent  without  effective  pruning 
methods. 

Breadth  first  search  chooses  the  largest  partition  element  for  further  partitioning. 
This  method  provides  a  globally  convergent  procedure,  even  if  without  using  local 
information  from  partition  elements.  The  branching  operation  is  complete.  This  is 
true  since  the  search  is  exhaustive,  and  there  is  no  unexplored  feasible  region  in  the 
limit.  With  consistent  bounding  operation,  a  global  optimal  solution  can  be  achieved 
in  the  limit  (cf.  Theorem  7.1).  For  feedforward  neural  network  training  problem, 
a  natural  lower  bound  exits.  Thus  if  we  provide  a  error  tolerance  e,  we  will  have  a 
finitely  convergent  algorithm. 

Best  first  and  depth  first  search  methods  can  also  be  made  globally  convergent 
with  effective  pruning  procedures.  In  this  case,  the  branching  operation  in  the  branch- 
and-bound  algorithm  would  become  complete. 

The  bounding  improving  search  strategy  chooses,  among  all  active  partition  el- 
ements, the  one  with  the  smallest  lower  bound.  Thus  the  global  lower  bound  is 
nondecreasing.  With  consistent  bounding  operation,  Theorem  7.2  ensures  this  pro- 
cedure is  globally  convergent.  When  the  difference  between  the  current  global  upper 
bound  and  the  current  local  lower  bound  is  small  enough  in  a  subregion,  that  subre- 
gion  can  be  deleted  from  further  consideration.  Thus  the  success  of  this  procedure 
depends  largely  on  the  effectiveness  of  computing  upper  and  lower  bounds  over  the 
weight  subsets. 

8.2.3     Lower  and  Upper  Bounding 

In  Section  8.1  we  have  discussed  computing  Lipschitz  constant  for  a  feedforward 
neural  network.  Based  on  the  local  Lipschitz  constant,  lower  bounds  over  the  subre- 
gion can  be  easily  computed  as  shown  in  Section  7.2.  There  are  several  ways  to  find 
the  upper  bounds  over  the  partition  elements.  With  the  Piyavskii's  algorithm  applied 
to  the  diagonal  line  of  the  weight  hyper-rectangle,  an  upper  bound  is  taken  as  the 
minimum  of  the  function  evaluated  at  the  lower  and  upper  vertices  and  an  interior 
point  from  which  the  current  hyper-rectangle  will  be  partitioned  later.  The  interior 
point  can  be  either  the  middle  point  on  the  diagonal  line  or  a  point  determined  by 
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the  Piyavskii's  algorithm  that  is  biased  towards  the  upper  or  lower  vertex  with  a 
lower  function  value. 

One  of  the  drawbacks  of  the  univariate  Piyavskii's  algorithm  extended  to  higher 
dimension  is  that  searching  upper  bound  on  the  diagonal  line  can  miss  a  lot  of  promis- 
ing regions.  To  explore  wider  areas  in  the  partition  element,  we  may  evaluate  the 
criterion  function  at  several  random  points,  and  then  subdivide  the  partition  element 
at  the  point  that  gives  the  tightest  upper  bound. 

The  effectiveness  of  global  optimization  approaches  can  generally  be  improved 
with  some  efficient  local  search  procedure,  when  it  is  applicable.  A  variety  of  local 
search  method  may  be  employed  under  the  general  branch-and-bound  framework 
to  find  local  upper  bounds.  Since  a  feedforward  neural  network  is  equivalent  to  a 
continuously  differentiable  mapping,  it  is  natural  to  consider  using  gradient  based 
local  search  methods,  such  as  classic  backpropagation  and  its  various  extensions. 
How  can  we  combine  the  branch-and-bound  procedure  (BB)  with  backpropagation 
(BP)  local  search  and  obtain  a  globally  convergent  algorithm  is  discussed  next. 

8.3     Combined  BB  and  BP 

Since  global  optimization  problems  are  generally  much  more  difficult  to  solve 
than  local  optimization  problems,  there  is  hardly  any  efficient  global  optimization 
algorithm.  Our  branch-and-bound  based  neural  network  training  algorithm  (GOTA) 
uses  implicit  exhaustive  search  in  order  to  obtain  a  guaranteed  global  optimal  so- 
lution. Even  with  local  Lipschitz  constant  based  pruning  procedure,  GOTA  is  still 
not  an  efficient  algorithm  (see  experimental  results  in  the  next  section).  The  prob- 
lem is  especially  keen  for  large  neural  networks,  as  the  number  of  partition  elements 
grows  exponentially  with  the  dimension  of  the  weight  space.  On  the  other  hand, 
for  BP  based  local  search,  the  computational  effort  grows  polynomially  with  the  size 
of  the  network  (Scalero,  1989).  However,  BP  based  training  algorithms  have  many 
shortcomings  as  discussed  in  Chapter  2.  Among  those  shortcomings  is  the  lack  of  a 
globally  convergent  property. 

Taking  advantage  of  both  the  GOTA  and  the  BP  algorithm  by  incorporating  BP 
as  a  local  search  procedure  of  GOTA  would  yield  a  globally  convergent  procedure 
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with  improved  efficiency.  However,  keeping  track  of  the  BP  search  procedure  is  very 
difficult  as  there  are  no  binding  constraints  that  retain  the  BP  solution  within  the 
current  partition  element.1 

Instead  of  considering  the  partition  element  where  the  local  search  procedure 
stops,  let  us  consider  the  partition  element  where  the  BP  starts.  If  from  the  current 
partition  element  the  BP  procedure  leads  to  a  global  optimal  solution,  then  the 
GOTA  procedure  halts.  If  the  BP  procedure  ends  up  with  a  local  minimum,  the 
current  partition  element  is  put  at  the  bottom  of  the  partition  element  list.  Other 
partition  elements  will  be  searched.  If  there  is  no  other  partition  element,  then 
the  current  partition  element  would  be  partitioned.  Local  search  is  applied  to  the 
newly  created  subregions.  This  is  essentially  a  breadth  first  global  search  combined 
with  local  BP.  Per  discussion  in  the  last  section,  the  local  search  augmented  GOTA 
(LGOTA)  is  globally  convergent. 

There  are  two  important  issues  in  the  implementation  of  LGOTA.  One  is  when  to 
invoke  the  local  search  procedure,  and  the  other  is  how  to  identify  a  local  minimum. 
Local  search  can  be  initiated  at  each  partition  element.  The  strategy  would  not 
be  efficient  for  problems  with  numerous  local  minima.  As  a  general  guideline,  Torn 
and  Zilinskas  (1989)  suggested  to  start  local  search  when  a  random  sampling  point 
in  a  subregion  yields  a  solution  better  than  the  current  global  upper  bound.  For 
feedforward  neural  network  training,  a  global  error  threshold  may  be  provided.  Local 
search  would  be  invoked  when  the  current  error  is  less  than  the  global  threshold.  This 
threshold  can  range  from  0  (no  local  search  is  performed)  to  the  maximum  possible 
global  error  (local  search  is  always  performed). 

Identifying  a  local  minimum  solution  for  the  neural  network  training  problem  is 
relatively  easy  as  the  global  minimum  criterion  function  value  is  known.  For  any 
weight  point  with  function  value  larger  than  the  global  minimum  value  (possibly, 
added  with  an  error  tolerance),  it  can  be  identified  as  a  local  minimum  if  the  gradient 
is  zero.  Theoretically,  the  point  may  also  be  an  inflection  point.  Practically,  whenever 


xIn  fact,  we  may  not  want  to  confine  the  BP  procedure  in  the  subregion,  because  it  may  well 
lead  to  a  global  optimal  solution  in  some  neighboring  subregions. 
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Table  8.2.  GOTA  Iterations  for  Solving  the  XOR  Problem 


errorThresh 

7  =  1.0 

7  =  2.0 

7  =  4.0 

0.0 

10350 

8115 

12556 

0.1 

10317 

8111 

12556 

0.2 

7557 

tin 

12556 

0.4 

1917 

2847 

3669 

the  gradient  is  too  small  (when  BP  ceases  to  be  effective),  the  current  partition 
element  can  be  dropped  to  the  bottom  of  the  search  list. 

8.4     Experiments  with  GOTA  and  LGQTA 

For  historical  reasons,  the  XOR  problem  is  widely  used  as  a  benchmark  problem 
for  testing  neural  network  training  algorithms.  Two  standard  network  structures  have 
been  used  for  solving  the  XOR  problem.  One  has  a  2  x  2  x  1  layered  structure  and 
the  other  has  a  2  x  1  x  1  structure  with  direct  connections  from  the  input  units  to  the 
output  unit.  We  use  the  former  to  test  GOTA  and  LGOTA  as  it  is  the  more  difficult 
one.  In  the  following,  all  starting  weight  sets  are  a  hyper-rectangle  with  lower  vertex 
being  (-10,  -10,  •  •  •  ,  -10)  and  upper  vertex  being  (10, 10,  •  •  •  ,  10)  unless  explicitly 
stated  otherwise. 

8.4.1     GOTA  with  Different  Error  Thresholds 

We  first  test  GOTA  based  on  Piyavskii  lower  bounding  and  upper  bounding.  The 
stopping  criteria  are:  (1)  the  total  sum  of  squared  (TSS)  error  less  than  0.04,  or  (2)  the 
total  number  of  iterations  exceeds  the  maximum  allowed.  In  the  following  tables  the 
integer  numbers  are  the  iterations  at  which  the  TSS  error  drops  below  the  stopping 
error.  If  a  real  number  appears  in  the  table  where  an  iteration  number  should  be, 
that  real  number  indicates  the  TSS  error  when  the  total  number  of  iterations  exceeds 
the  specified  maximum  (10000).  Note  that,  in  general,  either  one  of  the  stopping 
criteria  terminates  the  algorithm,  but  not  both. 

Lower  bound  improving  search  strategy  is  used  in  Table  8.2.  The  parameter 
errorThresh  determines  when  an  output  value  is  considered  to  be  correct.  When  the 
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difference  between  the  output  value  and  the  target  value  is  less  than  the  errorThresh, 
the  output  is  considered  correct.  As  expected,  the  number  of  iterations  decreases 
when  the  error  threshold  increases.  The  parameter  7  determines  the  slope  of  the 
sigmoid  function.  GOTA  has  better  average  performance  with  7  =  2.0. 

Results  in  Table  8.2  shows  that,  compared  with  the  backpropagation  algorithm, 
GOTA  is  not  efficient.  On  the  average  the  BP  algorithm  takes  a  few  hundred  iter- 
ations to  train  the  XOR  network.  However,  BP  solutions  get  stuck  in  local  minima 
with  10-25%  chance  when  random  initial  weights  are  used.  Once  a  solution  falls  into 
a  local  minimum,  the  BP  algorithm  simply  fails  to  solve  the  problem  no  matter  how 
many  more  iterations  it  runs.  GOTA  is  a  globally  convergent  algorithm.  Its  efficiency 
can  be  improved  with  better  pruning  methods  and/or  local  search  procedures. 

8.4.2     GOTA  with  Heuristic  Pruning 

We  discussed  different  search  strategies  in  Section  8.2.  For  the  XOR  problem, 
the  bound  improving  search  strategy  seems  the  most  effective.  Theoretically,  all 
the  search  strategies  with  GOTA  converge  to  a  global  optimal  solution  in  the  limit. 
However,  the  best  first  search  and  depth  first  search  are  much  less  effective.  They 
fail  to  reduce  the  TSS  error  to  the  acceptable  level  within  10000  iterations. 

We  consider  a  heuristic  pruning  method.  Since  the  Lipschitz  constant  is  an  esti- 
mation of  the  maximum  of  the  gradient  (norm),  we  would  expect  the  weight  subsets 
with  small  Lipschitz  constant  have  little  chance  of  containing  an  global  optimal  solu- 
tion, at  least  when  the  weight  subsets  are  relatively  large.  The  fact  that  the  optimal 
weights  of  an  FNN  are  found  in  deep  valleys  of  the  error  surface  also  supports  above 
assertion.  Thus  we  may  delete  weight  subsets  with  small  local  Lipschitz  constant. 
This  indeed  improves  the  speed  of  GOTA.  Table  8.3  shows  the  number  of  iterations 
required  to  learn  the  XOR  problem  for  different  search  strategies.  Again,  a  real  num- 
ber indicates  the  TSS  error  when  the  maximum  number  of  iteration  is  reached.  The 
error  threshold  used  in  Table  8.3  is  0.4. 

The  heuristic  pruning  method  needs  the  knowledge  of  the  Lipschitz  constant  of  a 
given  FNN.  This  can  be  obtained  through  the  first  iteration  of  GOTA.  However,  there 
is  no  established  rule  to  find  the  best  Lipschitz  constant  threshold  (lipThresh),  which 
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Table  8.3.  GOTA  with  Heuristic  Pruning 


lipThresh 

Best  first 

Bound  Improve 

Depth  first 

Breath  first 

0.0 

1.0 

1917 

0.99 

3477 

0.8 

1.0 

1917 

0.73 

2931 

1.0 

1.0 

1846 

0.54 

2578 

1.1 

1.0 

1732 

0.45 

2252 

1.2 

1.0 

1274 

1187 

1110 

1.203 

4151 

1187 

fail* 

1019 

The  PE  list  was  exhausted 


Table  8.4.  GOTA  with  Local  Random  Search 


Global  search 

mean 

std.  dev. 

min 

max 

errorThresh 

Bound  Improve 

1725 

1180 

208 

4526 

0.1 

718 

543 

16 

1460 

0.4 

Breath  first 

1470 

1381 

181 

4912 

0.1 

719 

725 

86 

3164 

0.4 

determines  when  a  partition  element  should  be  deleted.  The  larger  the  threshold, 
the  more  weight  subsets  will  be  pruned.  Using  too  large  a  threshold  runs  into  the 
risk  of  deleting  also  the  partition  elements  containing  global  optimal  solutions,  hence 
exhausting  the  partition  element  search  list  without  finding  an  optimal  solution. 

8.4.3     GOTA  with  Random  Local  Search 

Finding  an  upper  bound  over  a  partition  element  may  use  random  search,  which 
increases  the  search  scope  as  compared  with  the  Piyavskii  algorithm  where  only  the 
bisection  point  and  the  upper  and  lower  vertices  are  evaluated.  Table  8.4  lists  the 
statistics  resulting  from  twenty  runs  of  GOTA  with  local  random  search.  Each  local 
search  evaluates  four  uniformly  distributed  random  points  in  the  partition  element. 
Compared  with  Piyavskii  upper  bounding  (Table  8.2),  we  see  that  local  random 
search,  on  the  average,  increases  the  training  speed  of  GOTA. 


Table  8.5.  LGOTA  vs  BP  with  Different  7 


Parameter 

LGOTA 

Backprop 

BP  fail  rate 

mean 

std.  dev. 

mean 

std.  dev. 

7=  1.0 

212 

201 

208 

98 

10% 

7  =  5.0 

122 

96 

89 

92 

10% 

7  =  10.0 

94 

95 

72 

67 

25% 

8.4.4     GOTA  with  BP  Local  Search 
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Gradient  based  search,  when  available,  usually  increases  the  efficiency  of  global 
optimization  algorithms.  We  incorporate  backpropagation  as  a  local  search  subrou- 
tine in  our  global  procedure  to  yield  a  local  search  augmented  global  optimal  training 
algorithm  (LGOTA).  A  series  of  experiments  were  conducted  to  test  the  performance 
of  LGOTA  in  comparison  with  the  backpropagation  algorithm.  In  the  following  tables 
the  data  are  average  of  twenty  experiment  runs.  The  average  for  the  BP  algorithm 
is  taken  from  only  those  succeeded  in  finding  a  global  minimum  error.  The  initial 
weights  are  random  numbers  generated  uniformly  in  (-1,1). 

First,  We  tested  LGOTA  and  BP  with  different  gain  factor  7.  Table  8.5  shows  that 
as  7  increases,  both  LGOTA  and  BP  have  increased  learning  speed.  The  performance 
of  LGOTA  is  not  as  good  as  succeeded  BP  runs,  but  fairly  close.  The  important  fact 
is  that  LGOTA  always  find  a  global  optimal  solution,  while  BP  has  between  10  to 
25  percent  chance  of  failure. 

Learning  rate  7/  plays  an  important  role  in  the  BP  algorithm.  BP  failure  rate 
increases  both  when  r]  is  too  small  and  when  77  is  too  large  (Recall  that  BP  is  not 
pure  gradient  descent,  as  a  momentum  term  is  used).  Generally,  as  the  momentum 
a  increases,  the  learning  speed  of  BP  increases.  Table  8.6  and  Table  8.7  show  similar 
patterns  as  Table  8.5.  That  is,  LGOTA  is  comparable  to  BP,  but  maintains  global 
convergence. 

We  further  tested  LGOTA  with  the  parity-3  problem.  In  this  case,  we  are  inter- 
ested in  the  condition  to  start  BP  local  search.  We  use  the  parameter  improveThresh. 
When  the  difference  of  the  current  error  and  the  mean  of  errors  for  the  last  three 
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Table  8.6.  LGOTA  vs  BP  with  Different  17 


Parameter 

LGOTA 

Backprop 

BP  fail  rate 

mean 

std.  dev. 

mean 

std.  dev. 

77  =  0.1 

895 

675 

584 

96 

15% 

77  =  0.5 

212 

201 

208 

98 

10% 

77=  1.0 

124 

67 

107 

42 

10% 

77  =  2.0 

77 

47 

66 

37 

10% 

77  =  5.0 

175 

147 

142 

186 

25% 

Table  8.7.  LGOTA  vs  BP  with  Different  a 


Parameter 

LGOTA 

Backprop 

BP  fail  rate 

mean 

std.  dev. 

mean 

std.  dev. 

a  =  0.0 

1140 

735 

1123 

624 

10% 

a  =  0.5 

729 

607 

554 

184 

10% 

a  =  0.9 

212 

201 

208 

98 

10% 

local  iteration  is  less  than  the  improvement  threshold,  the  local  search  is  stopped, 
and  the  global  procedure  resumes.  Note  that  the  BP  algorithm  corresponding  to  the 
case  improveThresh  is  zero.  In  Table  8.8  both  the  global  iterations  and  the  total  local 
iterations  are  reported.  The  first  part  of  the  data  is  the  number  of  global  iterations. 
A  zero  global  iteration  mean  the  first  local  search  succeeded  in  obtaining  a  global 
optimal  solution.  The  experiment  results  show  that  as  the  improvement  threshold 
increases,  the  average  number  of  local  iterations  decreases  while  the  number  of  global 
iterations  increases.  The  important  thing  to  notice  is  that  the  total  effort  of  reaching 
a  solution  is  reduced  with  global  search. 
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Table  8.8.  LGOTA  Iterations  for  Parity-3  Problem 


improveThresh 

0.0001 

0.0000001 

0.00000000001 

run  1 

2-381 

0-282 

0-187 

run  2 

2-863 

1-2263 

0-261 

run  3 

1-219 

0-72 

0-160 

run  4 

2-680 

0-726 

1-98705 

run  5 

5-905 

1-4941 

1-23244 

run  6 

1-254 

1-29906 

0-157 

run  7 

2-483 

0-103 

0-339 

run  8 

4-929 

0-3127 

0-134 

run  9 

1-119 

0-41 

1-217355 

run  10 

2-458 

1-75528 

0-623 

mean 

473 

11699 

34117 

CHAPTER  9 
SUMMARY  AND  CONCLUSIONS 

In  this  dissertation  we  have  conducted  an  extensive  literature  survey  of  the  current 
research  on  feedforward  neural  networks.  This  survey  identifies  a  series  of  deficiencies 
of  the  most  popular  neural  net  learning  algorithm — the  backpropagation  procedure. 
Although  the  BP  algorithm  has  been  successfully  applied  to  many  real  world  prob- 
lems, its  shortcomings  become  less  tolerable  as  the  demand  in  neural  net  learning 
speed  and  solution  quality  increases.  We  have  addressed  a  range  of  problems  asso- 
ciated with  backpropagation  learning,  with  a  focus  on  developing  fast  and  globally 
convergent  neural  net  training  algorithms. 

9.1     Contributions 

We  proposed  a  globally  guided  variation  of  the  classic  backpropagation  learning 
algorithm.  Our  globally  guided  backpropagation  (GGBP)  considers  optimization  of 
the  global  criterion  function  in  the  output  space,  rather  than  in  the  conventional 
weight  space.  The  results  of  this  procedure  is  faster  learning,  and  convergence  to  a 
global  optimal  solution.  The  new  method  also  requires  less  user  input  as  the  number 
of  user  defined  parameters  is  reduced.  GGBP  is  shown  to  be  equivalent  to  standard 
BP  with  a  dynamic  learning  rate. 

Both  stochastic  and  deterministic  global  optimization  approaches  are  employed 
for  neural  network  training.  A  few  published  reports  have  addressed  neural  net 
training  by  genetic  algorithm,  simulated  annealing,  and  pure  random  searches.  We 
approach  the  problem  with  new  ideas  and  heuristics  incorporated  in  the  stochastic 
algorithms. 

We  have  pioneered  applying  deterministic  global  optimization  methods  to  neural 
net  training.  The  methods  we  used,  namely,  branch  and  bound  based  Lipschitz  opti- 
mization, are  shown  to  be  convergent.  Hence  global  optimal  solutions  are  guaranteed 
by  these  procedures.    We  have  shown  that  the  criterion  function  of  a  feedforward 
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neural  network  is  Lipschitzian.  Furthermore,  we  developed  efficient  procedures  for 
computing  Lipschitz  constant  over  subsets  of  the  weight  space.  The  availability  of 
local  Lipschitz  constant  makes  effective  pruning  in  the  branch  and  bound  procedure 
possible. 

Local  search  procedures,  such  as  the  gradient  based  backpropagation,  can  be 
incorporated  in  our  global  optimal  training  algorithm  (GOTA).  The  local  search 
augmented  global  algorithms  (LGOTA)  improve  the  learning  efficiency  of  GOTA, 
while  retaining  the  globally  convergent  property. 

9.2     Further  Research 

We  have  developed  neural  network  training  algorithms  that  produce  global  opti- 
mal solution.  However,  at  this  stage  those  algorithms  apply  only  to  neural  networks 
with  static  structures.  We  discussed  in  Chapter  4  that  for  different  problems  the 
optimal  neural  network  structures  are  different.  Extending  the  GOTA  approach  to 
finding  the  optimal  neural  network  structure  would  be  a  significant  research  topic. 
The  general  framework  of  GOTA  allows  different  neural  network  structures  being 
used  as  different  main  branches  in  the  search  tree.  Adding  or  deleting  processing 
units  and/or  connections  can  also  be  implemented.  This  would  result  in  a  procedure 
that  dynamically  fits  a  problem  with  the  best  network  structure  and  the  best  weight 
set. 

Multilayered  perceptrons  are  able  to  represent  a  broad  class  of  problems.  However, 
because  of  the  lack  of  effective  training  procedures,  perceptrons  have  been  used  only 
for  linearly  separable  problems.  GOTA  can  be  applied  for  perceptron  training  as 
the  existence  of  gradient  of  the  criterion  function  is  not  required.  This  fact  may  be 
more  important  than  it  appears,  as  Poggio  and  Griosi  (1990)  pointed  out  that  the 
activation  function  in  the  Kolmogorov  network  (cf.  Chapter  3)  may  be  continuous, 
but  not  smooth.  Such  kind  of  activation  functions  makes  the  use  of  gradient  based 
training  algorithms  infeasible,  less  to  say  to  find  a  global  optimal  solution. 

Functional  link  neural  networks  are  generalization  of  standard  neural  networks. 
However,  choosing  the  functions  is  still  based  on  trial  and  error.  Increasing  the  type  of 
activation  functions  within  a  neuron  has  also  been  proposed.  Mani  (1990)  has  tried 
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to  use  a  generalized  gradient  descent  approacii  in  searching  the  functional  space. 
Unfortunately,  the  order  of  functionals  is  not  easily  defined.  The  GOTA  approach 
can  be  readily  extended  to  functional  search. 

Instead  of  finding  a  global  optimal  solution  over  the  training  set,  the  value  of  the 
training  algorithm  will  significantly  increase  if  we  seek  optimizing  the  performance 
of  the  neural  network  over  the  whole  instance  domain.  Extending  GOTA  to  train  a 
feedforward  neural  network  aimed  at  generalization  is  another  research  topic  worth 
pursuing. 
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// 

//  Header  file  for  FNET.cc  and  BB.cc  (gota.h) 

//  Branch  and  Bound  (BB)  based  Neural  Network  Training  Program 

//  by 

//        Zaiyong  Tang 


// 
// 

// 
// 
// 

// 
// 
// 
// 
// 
// 
// 


351  Business  Building 

Dept.  of  Decision  and  Information  Sciences 

College  of  Business  Administration 

University  of  Florida 

Gainesville,  Florida  32611 

Phone:  904/392-9600  (0) 
904/334-5430  (H) 

Internet:   ZTOcis.ufl.edu 

TangQmath . uf 1 . edu 


// 
// 
// 
// 
// 
// 
// 
// 
// 
// 
// 
// 
// 
// 
// 
// 
// 
// 


// 

//  Current  version  1.0  911228  z.t. 


// 

'I,  >> 

#ifndef  LIP_H 
#define  LIP_H 


#define  GNU.CPP 
//#define  TURBO.CPP 
//#define  DISPLAY.YES 

#ifdef  GNU.CPP 

#define  instream  istream 
#define  outstream  ostream 

#else 

tfdefine  instream  ifstream 
#define  outstream  ofstream 

#endif 


//Choose  one  of  the  systems  ! ! ! 
//g++  or  Turbo  C++ 
//For  Turbo  C++ 


//for  GNU-C++  (g++) 


//for  Turbo  C++ 


#ifdef  TURB0_CPP 

#include   "fstream.h" 
#else 

#include   "stream. h" 


#endif 
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#include  "stdlib.h" 

#include  "stdio.h" 

#include  "float. h" 

#include  "math.h" 

#include  "string. h" 

#include  "time.h" 

#define  real  double 

EXT  enum  bool  {false,  true}; 

EXT  enum  TrainMethod  { 

SAMPLE, 

RANINSTANCE, 

INSTANCE, 

qUICKPROP 
}; 

EXT  enum  UnitType  { 
SIGMOID, 
ASYMSIGMOID, 
GAUSSIAN 

>; 

EXT  enum  ErrorFun  { 
SumOf Square, 
HyperError 

>; 

EXT  enum  search  { 
BEST.FIRST, 
L0WER_C0NV, 
DEPTH.FIRST, 
BREATH.FIRST 

>; 


//may  change  the  precision  as  needed 
//Define  boolean  variable  true  and  false 


//sample  or  Epoch  training 
//randomized  instance  or  pattern  training 
//sequential  instance  or  pattern  training 
//quickpropagation  training 


//sigmoid  in  range  of  [-0.5,  0.5] 
//sigmoid  in  range  of  [0.0,  1.0] 
//Gaussian  unit 


//Sum  of  squared  error  criterion  function 
//atanh  criterion  function 


//search  strategies  in  the  branch  and  bound 
//procedure 


EXT  enum  findUB  { 
RANDOM, 
VERTICES , 
BACKPR0P, 

}; 


//methods  for  finding  upperbounds  over 
//partition  elements 


//////////////////////////////////////// 
//  Class  of  input  neuron  // 

//////////////////////////////////////// 

class  Input 
{ 
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protected: 

int    numOut ; 

real   out ; 

real   in; 
public: 

//Constructor  to  initialize  the  input  node 

Input (  void  )  {  numOut  =  0;  out  =  0.;  } 

void  input (  real  x  )    {  in  =  out  =  x;  > 

void  incNumout(void)    {  num0ut++;  } 

real  getout(void)       {  return  out;  } 

//The  following  virtual  fuctions  are  used  as  terminals  for 

//recursive  function  calls 

virtual  real  output ()        {return  out;} 

virtual  void  getDeltaW(void)   {return;} 

virtual  void  bp(real,  real)    {return;} 

virtual  void  sampleUpdate(void)  {return;} 

virtual  void  insUpdate(void)     {return;} 

virtual  void  qpUpdate(void)      {return;} 

virtual  void  getPartial(void)    {return;} 

virtual  void  setWeights(void)    {return;} 

virtual  void  setVertex(int  type)  {return;} 

virtual  real  findA(void)        {return  out;} 

virtual  real  findB(void)        {return  out;} 

virtual  void  putWeights(void)    {return;} 

virtual  void  displayWeights(void){return;} 

virtual  real  GetGrad(void)       {return  0.;} 


}; 


//////////////////////////////////////// 

//  Class  of  hidden  neuron  // 

//////////////////////////////////////// 


class  Hide 

{ 

len  :  public  Input 

protected: 

bool   updated; 
bool   done; 

int 

numln; 

int 

count ; 

real 
real 

eta,  alpha,  gamma; 
delta; 

real 

threshold; 

real 

netA,  netB; 

real 

tLV,  tUV; 

real 

deltaThresh; 

real 

patternDT; 

//learning  rate,  momemtum,  gain  factor 
//delta  is  the  partial  of  E  w.r.t.  the 
//weight  (without  the  input  term  from 
//previous  layer 

//lower  and  upper  input  for  sigmoid  function 
//threshod  for  lower  and  upper  vertices 
//patternDT  is  used  for  Sample  training 
//(Epoch  training) 
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real  prevDT; 

real  *w,  *partial;        //Weight  vector  and  partial  derivatives 

real  *lVertex,  *uVertex;   //Lower  and  upper  vertices 

real  *deltaWeight ;       //Change  in  weights 

real  *patternDW; 

real  *prevPartial; 


Hidden  **backLink; 


//backLink  is  used  to  point  to  the 
//incoming  node,  terminal  at  an  Input 
//node  (with  type  cast) 


void  addlink(  void*  ); 

public: 

Hidden(  void  ) ; 
"Hidden(  void  ) ; 


//to  be  used  in  node-link  operator 
//+=  defined  below 

//constructor 
//destructor 


void  setParameter(real  IRate,  real  momem,  real  gain) 

•(eta  =  IRate;  alpha  =  momem;  gamma  =  gain;  } 
real  output (  void  ); 

real  Activation(real  netln) ;       //activation  function 
real  ActPrime(real  Value);         //derivative  of  activation  function 

void  getDeltaW(void) ;  //accumulate  partials 

void  bp(  real  weight,  real  delta  );  //Backpropagate 

void  sampleUpdate(void) ;  //Weight  updating  after  sample 

void  insUpdate(void) ;  //updating  after  each  instance 

void  qpUpdate(void);  //using  quickprop 


virtual  real  GetGrad(  void  ) ; 
virtual  void  setWeights(  void); 
virtual  void  setVertex(int  type); 
virtual  void  putWeights(void) ; 
real  findA(void); 
real  findB(void); 
real  getA(void)  {return  netA;} 
real  getB(void)  {return  netB;} 


//weight  input,  include  threshold 
//weight  on  vertex,  include  threshold 
//weight  output 

//find  the  range  for  sigmoid  input 
//A  is  the  lower  and  B  is  the  upper 
//get  the  range  for  sigmoid  input 
//A  is  the  lower  and  B  is  the  upper 


>; 


virtual  void  displayWeights(void) ;   //Show  weights 

Hidden*  operator+=  (  Hidden*  x) 

{  addlink(  ftx  );  return  *this;  } 

Hidden*  operator+=  (  Input*  x  ) 

{  addlink(  *x  );  return  *this;  } 


//////////////////////////////////////// 
//  Class  of  output  neuron  // 

//////////////////////////////////////// 
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class  Output  :  public  Hidden 
{ 
public: 

void  setWeights(void) ; 

void  setVertex(int  type); 

void  putWeights(void) ; 

void  displayWeights(void) ; 

void  setlnit(real  initial) 

{  out  =  initial;  } 
void  tbp(  real  target  ) ; 
void  bp(  real  OutError  ) ; 
void  findRangeO; 
real  GetGrad(  void  ) ; 

}; 


//weight  input,  include  threshold 
//weight  on  vertex,  include  threshold 
//weight  output 


//setlnitO  used  for  GONNA 


//find  the  range  for  sigmoid  function 


//////////////////////////////////////// 
//  Class  of  input  pattern  // 

//////////////////////////////////////// 

class  Pattern 
{ 

public: 

bool  getMem(int  inSize,  int  outsize); 

real  *in,  *out; 

>; 


//////////////////////////////////////// 

//  Class  of  (neural)  network         // 
//////////////////////////////////////// 


class  Network:  public  Hidden 
{ 

protected: 

Input*  inNode; 

Hidden*  hNode; 

Output*  outNode; 

real  learningRate; 

real  momemtum; 

real  gainFactor; 

int  nln,  nHidden,  nOut; 

int  nPattern; 

int  inWidth,  inDepth; 

int  outWidth,  outDepth; 

int*  ranSequence; 

real*  patternError ; 

real  totalError; 

real  stopError; 


//set  random  order  of  the  instance  in  the 
//training  sample  for  insTraining  method 
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real*  solution; 
real*  xPartial; 
int  weightsln; 
int  weightsOut; 
int  weight sOn; 
int  trainingMethod; 
char  inWFile[20]; 
char  outWFile[20] ; 
unsigned  nEpoch,  iter; 
unsigned  maxlteration; 
Pattern*  pattern; 
public: 

Network(void) ; 

Network(char*  type); 

bool  build(char*  inFile); 

void  forward(void) ; 

void  backProp(void) ; 

bool  trained(void) ; 

bool  inputWeight(void)  {return  weightsln;} 

bool  outputWeight(void)  {return  weightsOut ;> 

bool  onWeights(void)    {return  weightsOn;} 

real  GetError  (void)    {return  totalError;} 

real  OutputError(real  target,  real  outcome); 

real  f indGrad(void) ; 

void  ComputeError(void) ; 

void  SetWeights(real  *Weights); 

void  SetWeights(real  *Weights,  int  type); 


//Input  or  output  weights, bool  doesn't  work 

//If  weightsOn  is  true,  then  showWeightsO 
//sample  or  instance  (random  or  sequential) 


//default,  with  arbitrary  connections 
//specific  net,  type  =  bp,  Hopfield,  etc. 


void  SetVertices(real  *LV,  real  *UV) ; 
void  FindSigRange(void) ; 
real  LipConst(void) ; 


//Put  Weights  into  the  Net. 
//Put  Weights  into  the  Net 
//with  vertex  type. 
//Put  vertices  into  the  Net. 


}; 


void  randSample(int  rFactor) ; 
void  readWeights(void) ; 
void  writeWeights(void) ; 
void  showWeights(void) ; 
void  showSolution(void) ; 
void  displayError(void) ; 


//randomize  the  sample  sequence 


//showSolutionO  for  GONNA 


//  Definitions  for  class  PartitionElement   I  I 

//Class  partition-element  is  the  subdivision  in  the  BB  procedure  that 
//contains  a  hyperrectangle  identified  by  its  lower  and  upper  vertices. 
//Each  partition  element  contains  a  lower  and  upper  bound  of  solutions 
//in  the  subregion,  and  a  feasible  solution  associated  with  the  upper  bound. 


class  PartitionElement 
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//for  PList  to  have  access  to  the  members  of  PE. 
//for  BB  to  have  access  to  the  members  of  PE. 

//Keep  track  the  reference  to  the  list 
//Used  for  generate  feasible  solutions 
//Set  knownLV  to  true  if  the  lower  vertex 
//has  be  evaluated. 

//True  if  local  search  BP  has  been  used 
//Lipschitz  constant 


//The  functional  values  for  the  vertices 


//solution  is  (pointer  to)  the  current  solution 
//lowerSolution  is  (pointer  to)  the  lower  bound, 
//pointer  to  next  PE 


friend  class  PList; 
friend  class  BB; 
protected: 

int  reference; 

int  poolSize; 

bool  knownLV; 

bool  knownUV; 
bool  didBP; 
real  lipConstant; 
real  lowerBound; 
real  upperBound; 
real  lowerValue; 
real  upperValue; 
real  middleValue; 
real  diaLength; 
real  *  lowerVertex; 
real  *  upperVertex; 
real  *  solution; 
real  *  lowerSolution; 
PartitionElement  *next ; 
public: 

PartitionElement (void);    //Construct  and  initialize  a  partition  element. 

PartitionElement (int  Dimen) ; 

PartitionElement (int  Dimen,  real  LipConstant); 

//Constructor  that  uses  adaptive  LipConstant. 
"PartitionElement (void) ;   //Destructor 

//Overload  '='  for  PE  assignment 

PartitionElement  k   operator  =  (PartitionElement  &  NewPE) ; 

//Free  the  memory 

void  Freemem(void) ; 

//Get  the  values 

real  LipConstant (void) {return  lipConstant;} 

real  LowerBound (void)  {return  lowerBound;} 

real  UpperBound (void)  {return  upperBound;} 

real  LowerValue(void)  {return  lowerValue;} 

real  UpperValue (void)  {return  upperValue;} 

real  DiaLength(void)  {return  diaLength;} 

real  *  LowerVertex(void)   {return  lowerVertex;} 

real  *  UpperVertex(void)   {return  upperVertex;} 

real  *  Solution(void)      {return  solution;} 

real  *  LowerSolution(void)  {return  lowerSolution;} 

real  CompuLip(real  *  lowerV,  real  *  upperV); 

//Set  the  values 

void  SetLipConstant(real  lipc)  {lipConstant  =  lipc;} 
void  SetLowerBound(real  LB)  {lowerBound  =  LB;} 
void  SetUpperBound(real  UB)  {upperBound  =  UB;} 
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void  SetLowerValue(real  LV)  {lowerValue  =  LV;  knownLV  =  true;> 
void  SetUpperValue(real  UV)  {upperValue  =  UV;  knownUV  =  true;} 
void  SetLowerVertex(real  *LV) ; 
void  SetUpperVertex(real  *UV)  ; 
void  SetSolution(real  *  Solution); 
void  SetLowerSolution(real  *LS); 
void  SetDiaLength(void) ; 
void  CompuLip(void) ; 

void  FindLower(void);     //Find  a  lower  bound  to  solutions  in 

//the  partition  element. 

void  FindUpper(real  *Parent ,  int  Method); 

//Find  a  feasible  solution  in  the  partition 
//element  that  serves  as  an  upper  bound. 

real  Evaluate(real  *Weights); 

//Get  the  f Value  of  the  solution, 
real  RanSearch(real  IRange,  real  uRange) ; 

//Get  an  random  value  within  the  ranges, 
int  FindMin(real  *  Vector) ; 

//Find  the  position  of  the  min  element  in  Vector. 

>; 


//  Definitions  for  class  PList  |  | 

//Class  partition  elemenet  (PE)  list.  PList  maintains  an  active  list  of 
//partitioning  elements  during  the  BB  procedure.  The  list  is  sorted 
//according  to  lower  (or  upper)  bound  associated  with  the  element. 
//New  PE  can  be  added,  and  those  with  lower  bound  greater  than  upper 
//bound  can  be  deleted. 
//May  add  a  sublist  that  contains  the  uncertain  PE  in  the  BB  procedure. 

class  PList 
{ 
friend  class  BB; 
protected: 
PartitionElement  *list;        //Points  to  a  list  of  PE  starting  with  the 

//list  head. 
PartitionElement  *LastPE(void) ;  //Get  the  pointer  to  the  last  PE. 
PartitionElement  *NewListPE(void) ; 

//Creat  a  new  PE. 
public: 
PList(void) ; 
PList(int  numPE);  //Construct  a  PList  with  numPE  PEs. 

//Default  is  one  PE  if  no  argu  is  given 
"PList(void);  //Dereference  the  list. 

//Overload  '='  for  list  assignment 
PList  ft  operator  =  (PList  ft  NewList); 
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void  Reference(PartitionElement  *NewPE) ; 

//Set  reference  count. 

void  Dereference (PartitionElement  *Pointor); 

//Check  reference  count  and  delete  PEs 
//that  still  exist.  This  prevents  deleting 
//non-exist  list  elements. 

PartitionElement  *  NewPE(void) ;  //Create  a  new  PE  with  the  'new'  operator 

PartitionElement  *  Get(void);   //Get  the  first  PE  of  the  list. 
PartitionElement  *  Get(int  Index); 

//Get  the  Index"th  PE  of  the  list, 
void  Add(PartitionElement  *NewPE) ; 

//Insert  a  PE  at  the  beginning  of  the  list, 
void  Insert(PartitionElement  *NewPE,  int  Index); 

//Insert  the  NewPE  before  the  PE 

//indicated  by  Index. 
void  Append (PartitionElement  *NewPE); 

//Insert  a  PE  at  the  end  of  the  list, 
real  GetUpperBound(int  Index);   //Get  the  upper  bounds  of  all  PE 
real  GetLowerBound(int  Index); 
void  RemovePE(real  Value);      //Remove  the  PEs  with  lower  bounds  larger 

//than  the  upper  bound, 
void  RemoveFirst(void) ;        //Remove  a  PE  from  the  list, 
void  RemoveLast(void) ; 

void  Merge(PList  ft  NewList);    //Merge  the  current  list  with  an  Hew  list 
void  Sort (char*  Key);  //Sort  the  list  by  lower  or  upper  bound 

void  SortU(void); 
void  SortL(void) ; 
int  FindUpIndex(real  Value);    //Find  the  place  the  NewPE  to  be  inserted. 


int  FindLoIndex(real  Value); 

>; 


//The  list  is  sorted  by  lower  of  upper  bound, 
int  ListLength(void);         //Get  the  length  of  the  list 


//================================================ 

//  Definitions  for  class  BB  (branch  and  bound)  I  I 

//Class  BB  (branch  and  bound)  implements  the  BB  algorithm. 

class  BB 
{ 
friend  class  Interface;    //For  run-time  control 
friend  class  PartitionElement; 
protected: 
long  iteration; 

long  maxlteration;       //A  stopping  criterion 
int   listSize;  //Determining  the  size  of  the  PList  that  is 
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//kept  in  memory.  Longer  list  is  truncated 
//and  the  truncated  part  is  saved  to  a  file. 

int  arraySize;  //The  size  of  PE  batch  to  for  getting  memory. 

int  PEcount; 

PList  listBuffer; 

real  currentLB;  //Bounds  from  the  current  PE 

real  currentUB; 

real  gLowerBound;        //Global  bounds  and  error  tolerance 

real  gUpperBound;        //that  serves  as  one  of  the  stopping 

real  *  gSolution;        //Keep  the  global  solution 

real  errorTolerance;      //criterion. 

real  weightMin;  //weightMin  and  weightHax  determines  the  initial 

real  weightMax;  //weight  range. 

real  *  listLowerBounds;   //List  bounds  are  vectors  the  keep  the  best  values 

real  *  listUpperBounds ;   //of  the  lists  save  in  files. 

PartitionElement  *PEarray ;//Used  in  memory  assignment, 
public: 

BB(void);  //Constructor,  initialize  parameters  and  creates 

//a  PList  with  an  initial  PE.  This  is  done 
//implicitely  by  invoking  PList  constructor. 

'BB(void) ; 

PartitionElement  *Initialize(void) ; 

PartitionElement  *Branching(void) ; 

//Select  a  PE  for  further  partition.  There  may 
//be  different  branching  strategies  such  as 
//depth  first,  breath  first,  and  best  first. 

void  Bounding(PartitionElement  *  PE) ; 

//Find  the  upper  and  lower  bound  of  the  selected 
//PE.  Then  partion  the  PE  further. 

void  LoadNet (PartitionElement  *  PE); 

//Load  the  currently  best  solution  to  the  net. 

void  ShowBound( PartitionElement  *PE) ; 

bool  Converged(void) ; 

int  MaxEdgeIndex(real  *templ,  real  *temp2); 

//Find  the  index  of  the  longest  edge  of  the  PE. 

PartitionElement  *  GetPE(void) ; 

//to  get  PE  from  PEarray  in  PartitionQ . 

//Save  to  disk  in  case  of  insufficient  memory 
void  SaveList(outstream  outFile); 

//Get  the  saved  list 
void  LoadList(instream  inFile); 

//Update  the  current  active  list  with  the 

//newly  created  PEs 
void  UpdateList(PartitionElement  *NewPE); 

//Put  the  newly  created  PE  in  the  listBuffer 

//according  to  given  search  method, 
void  MergeList(PList  *NewList); 
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//Get  the  best  value  from  those  lists  saved  in  file.  Then  load  the  list 
//and  recreate  the  current  listBuffer. 
void  FindBestlnFile(void) ; 

//PartitionO  divides  the  current  PE  into  two  or  more  sub-elements  and 
//return  a  linked  list  of  them  and  merge  the  newlist  with  the 
//exsisting  one. 

void  Partition(PartitionElement  *  CurrentPE) ; 
}; 

#endif 


"  If 

//  bb.cc  // 

//  Class  implementations  for  the  Branch  and  Bound  based  Piyavskii  // 

//  Algorithms  // 

// 

//  920126. zt 

// 


// 

// 

// 
//  Use  Piyavskii  lower  bounding.  920129  // 


// 


// 


//  Add  BB  search  options  (best  first,  depth  first  ...)  920329  // 


// 


// 


//  Add  compuLipO,  local  lipConstant  in  PartitionElement  920403  // 


// 


// 


//  Add  local  BackProp.  920529  // 


// 


// 


//**************************+*******+*+iL*if*if****mmm++mmttma*****// 
#define  EXT 

#include  "gota.h" 

Network  net;  //define  a  neural  net  instance  —  net 

int  numWeights  =  20;  //the  size  of  the  weight  vector 

int  bpCount  =  0;  //count  for  the  number  of  bp()  being  called 

int  showVertex;  //l  for  true,  show  the  vertices  of  the  hyper-rectangle 

real  lBound,  uBound;  //the  bounds  for  the  initial  partition  element 

real  startBpThresh  =.80;        //threshold  to  start  BP  when  error  is 

//less  than  that 
real  lipThresh  =0.1;  //Delete  the  PE  if  its  Lipschitz  constant 

//falls  below  the  threshold 
real  slopeThresh  =  .001;       //Stop  local  gradient  search  if  the  norm 

//of  the  gradient  is  less  than  that 
real  improveThresh  =  .000001;    //Stop  local  gradient  search  if  the  current 

//error  is  no  more  different  from  the  mean 

//error  of  the  last  three  iter  than  the  thresh 
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int  searchMethod; 
int  findUpperMethod; 

int  unitType  =  0; 

int  errorFunction  =  0; 

int  fanlnSplit  =  0  ; 

int  bufSize  =  0; 

int  wcount  =  0; 

int  singleStep; 

real  weightLowerRange  =  -1.; 
real  weightUpperRange  =  1 .  ; 
real  sigmoidPrimeOf f set  =  .1; 
real  errorThreshold  =  .  1 ; 
real  *  weightBuf; 


//see  header  file,  enum  search 

//the  way  to  find  the  local  upper  bound 

//sigmoid  or  assymetric  sigmoid  or  Gaussian 

//type  of  error  function 

//If  true(l),  eta  is  divided  by  #  of  inputs 

//The  size  of  the  weight  array 

//weight  count  for  control  weight  I/O 

//l  for  true,  stop  at  each  iteration 

//weight  range  (lower,  upper) 

//Add  to  sigmoid-prime  to  keep  it  from  being  0 
//Error  set  to  zero  if  less  than  threshold 
//Global  variables  for  control  weight  I/O 


//The  following  is  mostly  for  using  quickprop,  adapted  from  quickpropl. 


real  modeSwitchThreshold  =  .0;   //Inside  threshold,  do  normal  grad  descent 

//otherwise,  jump. 

//Jump  at  most  this  times  last  stepEXT 
//Weight  decay 


real  MaxFactor  =  2.25; 
real  weightDecay  =  -.0001 


void  Randomize (void) 
{ 

time_t  timeSeed; 

time(fttimeSeed) ; 

srand(timeSeed) ; 
> 


//use  current  time  as  a  seed  to  initialize  the 
//rand  number  generator 


//  definitions  for  class  PartitionElement 
//=======================================: 

PartitionElement: : PartitionElement (void) 
{ 

lowerBound  =  -100.; 

upperBound  =  100. ; 

poolSize  =  4; 

lipConstant  =  1.5; 

knownLV  =  false; 

knownUV  =  false; 

didBP  =  false; 


lowerVertex  =  new  real  [numWeights]  ; 
if  (! lowerVertex)  { 

cout  «  "Memory  allocation  error  in  PE  initialization.  \n"; 

exit(l); 
} 

upperVertex  =  new  real    [numWeights] ; 
if   (! upperVertex)  { 
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cout  «  "Memory  allocation  error  in  PE  initialization.  \n"; 
exit(l) ; 
} 

//solution  is  (pointer  to)  the  current  solution 
solution  =  new  real  [nuntWeights]  ; 
if  (! solution)  { 

cout  «  "Memory  allocation  error  in  PE  initialization.  \n"; 

exit(l); 
} 

//lowerSolution  is  (pointer  to)  the  lower  bound. 
lowerSolution  =  new  real  [numWeights] ; 
if  (! lowerSolution)  { 

cout  «  "Memory  allocation  error  in  PE  initialization.  \n" ; 

exit(l); 
} 

next  =  NULL;  //pointer  to  next  PE 

> 


PartitionElement: :PartitionElement(int  NumWeights) 

lipConstant  =  10000.0;       //A  gross  over-estimate 
lowerBound  =  0. ; 
upperBound  =  100.; 
poolSize  =  4; 

lowerVertex  =  new  real  [NumWeights] ; 
if  (! lowerVertex)  { 

cout  «  "Memory  allocation  error  in  PE  initialization.  \n"; 

exit(l); 
} 

upperVertex  =  new  real  [NumWeights] ; 
if  (lupperVertex)  { 

cout  «  "Memory  allocation  error  in  PE  initialization.  \n"; 

exit(l); 
} 

//solution  is  (pointer  to)  the  current  solution 
solution  =  new  real  [NumWeights] ; 
if  (! solution)  { 

cout  «  "Memory  allocation  error  in  PE  initialization.  \n"; 
exit(l); 

} 

//lowerSolution  is  (pointer  to)  the  lower  bound. 
lowerSolution  =  new  real  [NumWeights] ; 
if  (! lowerSolution)  { 

cout  «  "Memory  allocation  error  in  PE  initialization.  \n"; 
exit(l); 
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} 

next  =  NULL;  //pointer  to  next  PE 


} 


PartitionElement: :PartitionElement(int  numWeights,  real  LipConstant) 

lipConstant  =  LipConstant;    //Initialize  lipConstant 
lowerBound  =  0. ; 
upperBound  =  100.  ; 
diaLength  =  100.  ; 
poolSize  =  2; 

lowerVertex  =  new  real  [numWeights] ; 
if  (! lowerVertex)  { 

cout  «  "Memory  allocation  error  in  PE  initialization.  \n"; 

exit(l); 
} 

upperVertex  =  new  real  [numWeights] ; 
if  (lupperVertex)  { 

cout  «  "Memory  allocation  error  in  PE  initialization.  \n"; 

exit(l); 
} 

//solution  is  (pointer  to)  the  current  solution 
solution  =  new  real  [numWeights] ; 
if  ( ! solution)  { 

cout  «  "Memory  allocation  error  in  PE  initialization.  \n"; 

exit(l); 
> 

//lowerSolution  is  (pointer  to)  the  lower  bound. 
lowerSolution  =  new  real  [numWeights] ; 
if  (! lowerSolution)  { 

cout  «  "Memory  allocation  error  in  PE  initialization.  \n" ; 

exit(l); 
} 
next  =  NULL; 


PartitionElement: : "PartitionElement(void) 
{ 

if  (lowerVertex) 

delete  lowerVertex; 
if  (upperVertex) 

delete  upperVertex; 
if  (lowerSolution) 

delete  lowerSolution; 
if  (solution) 
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delete  solution; 

> 


//PE  assignment  through  over  loading  '=' 

PartitionElement*  PartitionElement :: operator  =  (PartitionElementft  PE) 

{ 

next  =  PE.next; 

poolSize  =  PE.poolSize; 

lipConstant  =  PE.lipConstant ; 

lowerBound  =  PE.lowerBound  ; 

upperBound  =  PE . upperBound ; 

//Should  copy  the  values  in  case  the  old  object  should  be  deleted. 
lowerVertex  =  PE.lowerVertex; 
upperVertex  =  PE.upperVertex; 
solution  =  PE. solution; 
lowerSolution  =  PE.lowerSolution; 
return  *this; 
> 


void  PartitionElement: : SetLowerVertex(real  *  LV) 
{ 

for  (int  i  =  0;  i  <  numWeights;  i++) 
lo»erVertex[i]  =  LV[i] ; 

} 


void  PartitionElement: :SetUpperVertex(real  *  UV) 
{ 

for  (int  i  =  0;  i  <  numWeights;  i++) 
upperVertex [i]  =  UV[i]; 

> 


void  PartitionElement: :SetSolution(real  *  Solution) 
{ 

for  (int  i  =  0;  i  <  numWeights;  i++) 
solution[i]  =  Solution [i] ; 

> 

void  PartitionElement: :SetLowerSolution (real  *  LS) 
{ 

for  (int  i  =  0;  i  <  numWeights;  i++) 
louerSolution[i]  =  LS[i]; 

> 

void  PartitionElement: :SetDiaLength(void) 

-c 

real  temp  =  0. ; 
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for  (int  i  =  0;  i  <  numWeights;  i++) 

temp  +=  (upperVertex[i]  -  lowerVertexCi] )  * 
(upperVertex[i]  -  lowerVertexCi]); 

diaLength  =  sqrt(temp); 
> 

void  PartitionElement: :FindLower(void) 
{ 

//lowerBound  is  passed  from  the  parent  PE  when  it  is  partitioned 

real  tmax  =  middleValue; 

if  (  tmax  <  lowerValue) 

tmax  =  lowerValue; 
if  (  tmax  <  upperValue) 

tmax  =  upperValue; 

real  temp  =  tmax  -  lipConstant  *  diaLength; 

//if  the  new  Lipschitz  bound  is  greater  than  the  one  from  the 
//parent  PE,  the  the  current  PE  uses  the  new  lowerBound,  otherwise 
//the  better  bound  is  kept 

if  (temp  >  lowerBound) 
lowerBound  =  temp; 
> 

void  PartitionElement: :FindUpper (real  *  parent,  int  method) 
{ 

real  tmin; 

switch  (method)  { 
case  RANDOM: 

real  *fValue  =  new  real  [poolSize] ;    //For  holding  function  values 
if  (!f Value)  { 

cout  «  "Memory  assignment  error  in  FindUpperO  An"; 
exit(l); 
} 

real  **temp  =  new  real*  [poolSize] ; 
for  (int  i  =  0;  i  <  poolSize;  i++)  { 
tempCi]  =  new  real  [numWeights]; 
if  (!temp[i])  { 
cout  «  "Memory  assignment  error  in  FindUpperO . \n"; 
exit(l); 
}  • 

for  (int  j  =  0;  j  <  numWeights;  j++)  { 
temp[i][j]  =  RanSearch(lowerVertex[j] ,  upperVertex[j] ) ; 

fValue[i]  =  Evaluate(temp[i] ) ;       //Get  the  f Value  of  the  solution 

> 
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int  index  =  FindMin(f Value) ;  //Find  which  is  the  best  solution. 

upperBound  =  f Value [index] ; 

solution  =  tempCindex];  //Get  the  best  out  of  poolSize 

//solutions. 
lowerValue  =  f Value [0] ; 
upperValue  =  f Value [1] ; 
break; 
case  VERTICES: 

//For  holding  the  min  of  F(a) ,  F(b),and  F(w') 

lowerValue  =  Evaluate(lowerVertex) ; 
upperValue  =  Evaluate(upperVertex) ; 

middleValue  =  Evaluate(parent) ; 

tmin  =  middleValue;         //Keep  the  middleValue  for  using  in  FindLowerO 
if  (  tmin  >  lowerValue) 

tmin  =  lowerValue; 
if  (  tmin  >  upperValue) 

tmin  =  upperValue; 

upperBound  =  tmin; 
break; 
case  BACKPROP: 

lowerValue  =  Evaluate(lowerVertex) ; 
upperValue  =  Evaluate(upperVertex) ; 

real  *  init  =  new  real  [numWeights] ; 
if  (!init)  { 

cout  «  "Memory  assignment  error  in  FindUpperQ . \n" ; 

exit(l); 

} 

for   (int  j   =  0;    j   <  numWeights;   j++)    { 

init[j]    =  RanSearch(lowerVertex[j] ,   upperVertex[j]  ) ; 

tmin  =  Evaluate (init) ; 

real  errl  =  100. ; 
real  err2  =  100. ; 
if  (tmin  <  startBpThresh)  { 
net.SetWeights(init) ; 
didBP  =  true; 
while  (  tmin  >  0.04)  { 
net.backPropO  ; 
bpCount++ ; 

//  The  following  are  two  heuristics  to  stop  local  search 

//  Note  that  although  both  may  be  activated,  setting  a  strigent 
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//  threshold  for  one  method  can  effectively  turn  it  off. 


if  ((tmin  =  net .f indGradO )  <  slopeThresh) 
break; 

err2  =  errl; 

errl  =  tmin; 

tmin  =  net  .GetErrorO  ; 

errl  =  (tmin  +  errl  +  err2)/3.0; 

if(  fabs(errl  -  tmin)  <  improveThresh) 
break; 


} 

upperBound  =  tmin; 


//checking  the  norm  of  the 
//the  gradient  at  the  current 
//solution 


//get  the  mean  error  in  the 
//last  three  iterations 

//get  out  of  local  BP  when 
//no  apparent  improvement 


void  PartitionElement: :CompuLip(void) 
{ 

net .SetVertices(lowerVertex,  upperVertex) ; 

lipConstant  =  net .  LipConstQ ; 

> 

real  PartitionElement : :Evaluate(real  *  Solution) 
{ 

net. SetWeights (Solution) ; 

net.foroardO  ; 

real  temp  =  net  .GetErrorO  ; 

return  temp; 
> 


real  PartitionElement: :RanSearch(real  IRange,  real  uRange) 
{ 

real  temp; 

return  temp  =  IRange  +  (uRange  -  lRange)*(real)  (rand()'/.  10000)/10000.0; 


int  PartitionElement : :FindMin(real  *  Vector) 
{ 

int  k  =  0; 

real  temp  =  Vector [0]  ; 
for  (int  i  =  1;  i  <  poolSize;  i++)  { 
if  (Vector [i]  <  temp)  { 
temp  =  Vector [i] ; 
k  =  i; 
} 
} 
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return  k; 

} 


//  definitions  for  class  PList  |  | 

PList: :PList() 
{ 

list  =  new  PartitionElement [1] ; 

> 

/* 

PList: : PList (int  numPE  =  1) 

{ 

PartitionElement  dummy; 

PartitionElement  *tempPE  =  new  PartitionElement  [numPE]  ; 

tempPE[0]  =  dummy; 

for  (int  i  =  1;  i  <  numPE;  i++)  { 

tempPE [i]  =  dummy;  //Invoke  the  constructor  for  each  PE 

} 

list  =  tempPE; 
> 
*/ 

/* 

PList: : PList (int  numPE  =  100) 

{ 

PartitionElement  dummy; 

PartitionElement  *tempPE  =  new  PartitionElement  [numPE] ; 

tempPE[0]  =  dummy; 

for  (int  i  =  1;  i  <  numPE;  i++)  { 

tempPECi]  =  dummy;  //Invoke  the  constructor  for  each  PE 

tempPE[i-l] .next  =  fttempPE[i] ;   //Link  the  PEs  together. 

} 

list  =  tempPE; 
> 


PList: : "PList () 

•c 

if(list) 

delete  list; 

> 
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inline  void  PList : :Ref erence(PartitionElement  *  PE) 
•C 

if  (PE->reference  >=  0  ) 
++PE->ref erence  ; 

} 


void  PList : :Dereference(PartitionElement  *  PE) 
{ 

while  (PE->reference  >  0  &&  — PE->ref erence  ==  0)  { 
PartitionElement  *tempPE  =  PE->next; 
delete  (PE) ; 
PE  =  tempPE; 
} 
} 


//NewPEQ  simply  creates  a  PE  using  'new',  then  delete  can  be 
//used  to  free  the  memory  allocated  in  Deref erenceO 

inline  PartitionElement*  PList : :NewPE(void) 
{ 

PartitionElement*  PE  =  new  PartitionElement; 

PE  ->  reference  =  1; 

return  PE; 
} 


PartitionElement*  PList: : Get (void) 
{ 

//Get  the  first  PE  of  the  list.  It  is  often  referred  to  as  pop() 
if  ((list)  { 

cout  «"PE  list  is  empty,  BB  fails!  \n"; 
exit(l); 
} 

PartitionElement  *head  =  list; 
//  Ref erence(head) ; 
//  Deref erence(list); 

list  =  head  ->  next; 

return  head; 
} 


void  PList: : Add(PartitionElement  *  NewPE) 

{ 

//Add  to  the  front  of  the  list 
//PartitionElement  *  tempP  =  NewPE; 

NewPE  ->  next  =  list; 
list  =  NewPE; 
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void  PList: : Insert (PartitionElement  *  NewPE,  int  Index) 
{ 

//Insert  to  the  list,  at  the  position  before  the  PE  indicated  by  index; 

//The  label  of  PEs  in  PList  starts  with  0. 

PartitionElement  *  previous,  *tempP  =  list; 

if  (Index  ==  0) 

Add(NewPE); 
else  { 

for  (int  i  =  0;  i  <  Index;  i++)  { 
previous  =  tempP; 
tempP  =  previous  ->  next; 
} 

previous  ->  next  =  NewPE; 
(previous  ->  next)  ->  next  =  tempP; 
> 
} 


PartitionElement  *  PList : :LastPE(void) 
{ 

//get  the  pointer  to  the  last  PE  in  the  list 

PartitionElement  *tempP,  *  previous  =  list; 

for  (tempP  =  previous;  tempP;  previous  =  tempP,  tempP  =  previous  ->  next) 


//Null  statement,  stop  when  tempP  =  NULL 
return  previous ; 

> 


void  PList: : Append (PartitionElement  *  NewPE) 

{ 

//Add  to  the  end  of  the  list 
//PartitionElement  *tempP  =  NewPE; 

if  (  list  ==  NULL) 

list  =  NewPE; 
else 

(LastPEO)  ->  next  =  NewPE; 

> 


real  PList : :GetUpperBound(int  Index  =  0) 

{ 

//Get  the  upper  bound  of  PE  indicated  by  Index,  default  Index  is  0, 
//that  is,  get  the  value  from  the  PE  on  the  top  of  the  list. 
PartitionElement  *tempP,  *  previous  =  list; 
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if  (Index  ==  0) 

return  list  ->  UpperBoundO  ; 
else  { 

for  (int  i  =  0;  i  <  Index;  i++)  { 
tempP  =  previous  ->  next; 
previous  =  tempP; 
> 

return  tempP  ->  UpperBoundO; 
} 
> 


real  PList : :GetLowerBound(int  Index  =  0) 

{ 

//Get  the  lower  bound  of  PE  indicated  by  Index,  default  Index  is  0, 
//that  is,  get  the  value  from  the  PE  on  the  top  of  the  list. 
PartitionElement  *tempP,  *  previous  =  list; 

if  (Index  ==  0) 

return  list  ->  LowerBoundO  ; 
else  { 

for  (int  i  =  0;  i  <  Index;  i++)  { 
tempP  =  previous  ->  next; 
previous  =  tempP; 
} 

return  tempP  ->  LowerBoundO; 
> 
> 


void  PList: :RemoveFirst( void) 
{ 

//Remove  the  first  PE  from  the  list.  This  is  used  to  pick  a  PE 

//for  further  partitioning,  which  creates  new  PEs ,  but  the  old 

//one  is  no  longer  needed. 

PartitionElement  *tempP  =  list; 

tempP  =  tempP  ->  next; 
list  =  tempP; 

> 


void  PList : :RemoveLast( void) 
{ 

//Remove  the  last  PE  from  the  list.  This  is  useful  if  the  length  of 

//the  PE  list  need  to  be  controlled.  RemoveLastO  can  be  called  recursively 

//to  delete  as  much  PEs  as  desired. 

PartitionElement  *tempP,  *  previous  =  list; 

for  (tempP  =  previous;  tempP;  previous  =  tempP,  tempP  =  previous  ->  next) 
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//Null  statement,  stop  when  tempP  =  NULL 
previous  =  NULL; 

//Set  to  pointer  to  the  last  PE  to  NULL 
} 


void  PList: :RemovePE(real  Value) 

{ 

//Remove  the  PEs  with  lower  bounds  larger  the  given  value 
PartitionElement  *tempP,  *  previous  =  list; 

//The  first  PE  has  a  lower  bound  larger  the  given  value 
if  (previous  &ft  (previous  ->  LowerBoundQ  >=  Value))  { 

tempP  =  previous  ->  next; 

list  =  tempP; 

> 

if  (list  ==  NULL) 

return; 
//  previous  =  tempP; 
//  tempP  =  tempP  ->  next; 

//Iterative  search  through  the  list 

while   (tempP)  {  //There  are  more  PEs 

if  ((tempP  ->  LowerBoundQ)  >=  Value)  { 
previous  ->  next  =  tempP  ->  next; 
tempP  =  previous  ->  next; 
> 
else  { 

previous  =  tempP; 
tempP  =  previous  ->  next; 
} 
}  //end  while  (tempP) 
> 


void  PList: :Merge(PList  &  NewList) 

{ 

//Merge  the  current  list  with  an  New  list 
PartitionElement  *tempP  =  NewList .list ; 

while  (tempP)  { 
Append (tempP) ; 
tempP  =  tempP  ->  next; 

} 
} 


void  PList: : Sort (char*  Key) 
{ 
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//Sort  the  list  by  either  lower  or  upper  bound  according  sort  Key 
if  (KeyCO]  ==  'L'  II  Key[0]  ==  '1') 

SortL() ; 
else  if  (Key[0]  ==  'U'  ||  Key[0]  ==  'u') 

SortUO; 
else 

cout  «  "Invalid  sort  key  \n" ; 
> 

void  PList: :SortL(void) 
{ 

//Sort  the  list  by  lower  bound. 

//This  sort  routing  is  adapted  from  intList : : sort()  in  g++  library. 

//Strategy:  place  runs  in  queue,  merge  runs  until  done 

//This  is  often  very  fast 

//Do  nothing  if  list  is  empty  or  has  only  on  PE 
if  (list  ==  NULL  I  I  list  ->  next  ==  NULL) 
return; 

PartitionElement  *tempPl,  *tempP2,  *head  =  list; 

int  qLength  =  250;   //  Guess  a  good  queue  size,  realloc  if  necessary 
int  qln  =0;         //  Count  of  PE  in  the  queue. 

PartitionElement  **queue  =  new  PartitionElement  *  [qLength] ; 

while  (tempP2  !=  NULL)  { 

if  (tempPl->LowerBound()  >  tempP2->LowerBound() )  { 

if  (head  ==  tempPl)  {       //  minor  optimization:  ensure  runlen  >=  2 
head  =  tempP2; 

tempPl->next  =  tempP2->next ; 
tempP2->next  =  tempPl; 
tempP2  =  tempPl->next; 
} 
else  { 

if  (qln  >=  qLength)  { 
qLength  *=  2; 

queue  =  new  PartitionElement  *  [qLength] ; 
} 

queue [qln++]  =  head; 
tempPl->next  =  NULL; 
head  =  tempPl  =  tempP2; 
tempP2  =  tempP2->next; 
} 
> 
else  i 

tempPl  =  tempP2; 
tempP2  =  tempP2->next; 
} 
> 
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int  count  =  qln; 
queue [qln]  =  head; 
if  (++qln  >=  qLength) 

qln  =  0; 
int  qOut  =  0; 

while  (count —  >  0)  { 
tempPl  =  queue [qOut] ; 
if  (++qOut  >=  qLength) 

qOut  =  0; 
tempP2  =  queue [qOut]  ; 
if  (++qOut  >=  qLength) 
qOut  =  0; 

if  (tempPl->LowerBound()  <=  tempP2->LowerBound())  { 

head  =  tempPl; 

tempPl  =  tempPl->next ; 
} 
else  { 

head  =  tempP2; 

tempP2  =  tempP2->next; 
} 

queue [qln]  =  head; 
if  (++qln  >=  qLength) 

qln  =  0; 

for  (;;)  { 

if  (tempPl  ==  NULL)  { 
head->next  =  tempP2; 
break; 
} 

else  if  (tempP2  ==  NULL)  { 
head->next  =  tempPl; 
break; 
} 

else  if  (tempPl->LowerBound()  <=  tempP2->LowerBound() )  { 
head->next  =  tempPl; 
head  =  tempPl ; 
tempPl  =  tempPl->next; 
> 
else  { 

head->next  =  tempP2; 
head  =  tempP2; 
tempP2  =  tempP2->next; 
> 
} 
} 

list  =  queue [qOut] ; 
delete(queue) ; 
} 
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void  PList: :SortU(void) 
{ 

//Do  nothing  if  list  is  empty  or  has  only  on  PE 

if  (list  ==  NULL  I |  list  ->  next  ==  NULL) 
return; 

PartitionElement  *tempPl,  *tempP2,  *head  =  list; 

int  qLength  =  250;   //  Guess  a  good  queue  size,  realloc  if  necessary 

int  qln  =0;        //  Count  of  PE  in  the  queue. 

PartitionElement  **queue  =  new  PartitionElement  *  [qLength]; 

while  (tempP2  !=  NULL)  { 

if  (tempPl->UpperBound()  >  tempP2->UpperBound())  { 

if  (head  ==  tempPl)  {       //  minor  optimization:  ensure  runlen  >=  2 
head  =  tempP2; 

tempPl->next  =  tempP2->next ; 
tempP2->next  =  tempPl; 
tempP2  =  tempPl->next; 
} 
else  { 

if  (qln  >=  qLength)  { 
qLength  *=  2; 
queue  =  new  PartitionElement  *  [qLength] ; 

queue [qln++]  =  head; 
tempPl->next  =  NULL; 
head  =  tempPl  =  tempP2; 
tempP2  =  tempP2->next; 
} 
> 
else  { 

tempPl  =  tempP2; 
tempP2  =  tempP2->next; 
> 
> 

int  count  =  qln; 
queue [qln]  =  head; 
if  (++qln  >=  qLength) 

qln  =  0; 
int  qOut  =  0; 

while  (count —  >  0)  { 
tempPl  =  queue [qOut] ; 
if  (++qOut  >=  qLength) 

qOut  =  0; 
tempP2  =  queue[qOut]; 
if  (++qOut  >=  qLength) 

qOut  =  0; 
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if  (tempPl->UpperBound()  <=  tempP2->UpperBound() )  { 

head  =  tempPl; 

tempPl  =  tempPl->next; 
} 
else  { 

head  =  tempP2; 

tempP2  =  tempP2->next; 
} 

queue [qln]  =  head; 
if  (++qln  >=  qLength) 

qln  =  0; 

for  (;;)  { 

if  (tempPl  ==  NULL)  { 
head->next  =  tempP2; 
break; 
} 

else  if  (tempP2  ==  NULL)  { 
head->next  =  tempPl; 
break; 
> 

else  if  (tempPl->UpperBound()  <=  tempP2->Upper Bound () )  { 
head->next  =  tempPl; 
head  =  tempPl; 
tempPl  =  tempPl->next; 
> 
else  { 

head->next  =  tempP2; 
head  =  tempP2; 
tempP2  =  tempP2->next; 
> 
> 
} 

list  =  queue [qOut] ; 
delete(queue) ; 
} 


int  PList: :FindUpIndex(real  Value) 
{ 

//Find  the  place  the  NewPE  to  be  inserted. 

//The  list  is  sorted  by  upper  bound. 

int  count  =  0; 

PartitionElement  *tempP  =  list,  *previous; 

while (tempP)  { 

if  (tempP  ->  UpperBoundO  <  Value) 

return  count ; 
previous  =  tempP; 
tempP  =  previous  ->  next; 
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count++; 
} 
return  count; 

> 

int  PList: :FindLoIndex(real  Value) 
•C 

//Find  the  place  the  NewPE  to  be  inserted. 

//The  list  is  sorted  by  lower  bound. 

int  count  =  0; 

PartitionElement  *tempP  =  list,  *previous; 

while (tempP)  { 

if  (tempP  ->  LowerBoundO  >  Value) 

return  count ; 
previous  =  tempP; 
tempP  =  previous  ->  next; 
count++; 
} 

return  count; 
> 


int  PList: :ListLength(void) 
{ 

//Get  the  length  of  the  list 

int  count  =  0; 

PartitionElement  *tempP  =  list; 

for  (;  tempP;  tempP  =  tempP  ->  next,  ++count) 

;  //Null  statement 
return  count; 

> 


//=============================================: 

//  definitions  for  class  BB  —  branch  and  bound 

BB::BB() 
{ 

iteration  =  0; 

maxlteration  =  10000; 

listSize  =  100; 

arraySize  =  50; 

PEcount  =  0; 

gLowerBound  =  -100.0; 

gUpperBound  =  100.0; 

currentLB  =  -100.0; 

currentUB  =  100.0; 

errorTolerance  =0.1; 

weightMin  =  0.0; 
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weightMax  =  1.0; 

PEarray  =  new  PartitionElement  [arrays ize] ; 
if (! PEarray)  { 

cout  «  "Error  in  memory  assignment  in  BB  \n"; 

exit(l); 

> 

gSolution  =  new  real  [numWeights]  ; 
if ( IgSolution)  { 

cout  <<  "Error  in  memory  assignment  in  BB  \n"; 

exit(l); 

} 

//Not  used  yet,  the  listBuffer  is  unlimited  now 
listLowerBounds  =  new  real  [listSize] ; 
if (! listLowerBounds)  { 

cout  «  "Error  in  memory  assignment  in  BB  \n"; 

exit(l); 
} 

listUpperBounds  =  new  real  [listSize] ; 
if (! listUpperBounds)  { 

cout  «  "Error  in  memory  assignment  in  BB  \n"; 

exit(l); 

} 
} 

BB::"BB() 

{ 

if (listLowerBounds) 

delete  listLowerBounds; 
if (listUpperBounds ) 

delete  listUpperBounds; 
if  (PEarray) 
delete  PEarray; 
} 


PartitionElement  *  BB: : InitializeO 
{ 

real  LV  =  lBound;     //Later  to  set  changable 

real  UP  =  uBound; 

numWeights  =  bufSize; 

for  (int  i  =  0;  i  <  numWeights;  i++)  { 
listBuffer. list  ->  lowerVertex[i]  =  LV; 
listBuffer. list  ->  upperVertex[i]  =  UP; 
listBuffer. list  ->  solution[i]  =  (UP  +  LV)/2 

> 
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listBuffer.list  ->  SetDiaLengthO ; 

listBuffer.list  ->  FindUpper(listBuf f er .list  ->  SolutionO,  1 indUpperMethod) ; 

listBuffer.list  ->  FindLowerQ  ; 

gLowerBound  =  listBuffer.list  ->  LowerBoundO ; 

currentLB  =  gLowerBound; 

gUpperBound  =  listBuffer.list  ->  UpperBoundO  ; 

currentUB  =  gUpperBound; 

net. SetWeights(listBuffer. list  ->  SolutionO); 
net .forward () ; 
return  listBuff er .Get() ; 
> 


PartitionElement  *BB: :Branching() 

{ 

/* 

PartitionElement  *temp; 

temp  =  listBuffer.GetO  ; 

return  temp; 
*/ 

return  listBuffer.GetO; 
> 

bool  BB: :Converged(void) 
{ 

if  ((gUpperBound  <=  errorTolerance)  I |  (iteration  >=  maxlteration)) 

return  true; 
return  false; 
> 

void  BB: :ShowBound(PartitionElement  *PE) 
{ 

real  *tl,  *t2,  *t3; 

tl  =  PE->LowerVertex(); 

t2  =  PE->UpperVertex(); 

//  t3  =  PE->Solution() ; 

iteration++; 

cout  «  "E  bound  =  "  «  gUpperBound  «"  at  iter  "«  iteration 

«"  U  and  L:   ,,«PE->UpperBound()  «  "   "  «PE->LowerBound() 
«"  BP:  "  «bpCount  «"   :   "  «PE->LipConstant()«"\n"; 
if  (showVertex)  { 

cout  «  "LowerV        UpperV     \n"; 
for  (int  i=0;  i<numWeights ;  i++) 

cout  «  tl[i]  «"      "«t2[i]  «M\n"; 
} 
> 

void  BB: :Bounding(PartitionElement  *PE) 
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//Keeping  global  bounds  updated 
currentLB  =  PE  ->  LowerBoundQ  ; 
if  (gLowerBound  <  currentLB) 

gLowerBound  =  currentLB; 
currentUB  =  PE  ->  UpperBoundO  ; 
if  (gUpperBound  >  currentUB)  { 

gUpperBound  =  currentUB; 

real  *  tempS  =  PE  ->  SolutionO; 

for  (int  i  =  0;  i  <  numWeights;  i++)  { 
gSolution[i]  =  tempS[i]; 

> 
} 
} 


void  BB: :LoadNet (PartitionElement  *  PE) 
{ 

real  *  temp  =  PE  ->  solution; 

net . SetWeights (temp) ; 

net  .  forwardO  ; 
} 


int  BB: :MaxEdgeIndex(real  *templ,  real  *temp2) 

int  max 1  =  0 ,  i ; 

real  temp  =  temp2[0]  -  templ[0]; 

for  (i  =  1;  i  <  numWeights;  i++)  { 
if  ((temp2[i]  -  tempi [i] )  >  temp)  { 
temp  =  temp2[i]  -  templ[i]; 
maxi  =  i; 
} 
} 

return  maxi; 
} 


PartitionElement  *  BB: :GetPE(void) 

{ 

/* 

if  (PEcount  >=  arraySize)  { 

PEarray  =  new  PartitionElement  [arraySize] ; 

if(IPEarray)  { 

cout  «  "Error  in  memory  assignment  in  BB::GetPE  \n" ; 

exit(l); 
} 

PEcount  =  0; 
> 

return  ftPEarray [PEcount ++]  ; 
*/ 
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PEarray  =  new  PartitionElement [1] ; 

//using  PEarray  =  new  PartitionElement  causes  problem  for  g++ 

//but  not  Turbo  C++ 

ii(! PEarray)  { 

cout  «  "Error  in  memory  assignment  in  BB: .GetPE  \n"; 
exit(l); 
> 

return  PEarray; 
> 


void  BB: :Partition(PartitionElement  *PE) 
{ 

//Partition  a  PE  from  its  longest  edge.  This  partitioning  sch« 

//may  be  changed  to  more  sophisticated  ones. 

real  *templ  =  new  real  [numWeights] ; 
if  (! tempi)  { 

cout  «  "Memory  assignment  error  in  FindUpperO  An" ; 

exit(l); 
> 

real  *temp2  =  new  real  [numWeights] ; 
if  (!temp2)  { 

cout  «  "Memory  assignment  error  in  FindUpperO . \n"; 
exit(l); 
} 

for  (int  i  =  0;  i  <  numWeights;  i++)  { 

tempi [i]  =  PE  ->  lowerVertex[i] ; 

temp2[i]  =  PE  ->  upperVertex[i] ; 

} 

int  index  =  MaxEdgelndex (tempi ,  temp2); 

PartitionElement  *newl,  *new2; 

//The  new  PEs  keep  the  old  lower  and  upper  vertices 
newl  =  GetPEO; 
new2  =  GetPEO; 

newl  ->  SetLowerBound(PE  ->  LowerBoundO) ; 

newl  ->  SetLowerVertex (tempi); 

newl  ->  SetLowerValue(PE  ->  LowerValueO) ; 

new2  ->  SetLowerBound(PE  ->  LowerBoundO); 

new2  ->  SetUpperVertex(temp2); 

new2  ->  SetUpperValue(PE  ->  UpperValueO ) ; 

real  *temp  =  new  real  [numWeights] ; 
if  (Itemp)  { 

cout  «  "Memory  assignment  error  in  FindUpperO . \n" ; 
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exit(l); 
} 
real  tt  =  PE  ->  DiaLengthO; 

real  ttl  =  PE  ->  LipConstant() ;     //check  lipConstant  to  prevent  overflow 

ttl  =  (ttl  >  .0001)  ?  ttl:  .0001; 

ttl   =   (PE  ->  upperValue  -  PE  ->  lowerValue)/2. /ttl ; 

for   (i  =  0;    i  <  numWeights;    i++)   { 

//temp[i]    =   (templ[i]    +  temp2[i] )/2.0;  //simple  bisection 

tempCi]   =   (templCi]    +  temp2[i] )/2.0  //Piyavskii  bisection 

+  ttl   *    (temp2[i]    -  tempi [i] )/tt ; 

> 

//Only  the  longest  edge  is  subdivided.  Later  may  use  more  elaborated 
//division  scheme.  The  dividing  point  is  determined  by  the  lowerbounding 
//point  found  in  temp 

templ[index]  =  temp [index]  ; 
temp2[index]  =  temp [index] ; 

//The  first  new  PE  has  the  upper  vertex  as  the  old  upperVertex  except 
//the  longest  edge  is  subdivided. 

newl  ->  SetUpperVertex(temp2); 
new2  ->  SetLowerVertex(templ) ; 
newl  ->  SetDiaLengthO  ; 
new2  ->  SetDiaLengthO; 
newl  ->  CompuLipQ  ; 
new2  ->  CompuLipO; 

nesl  ->  FindUpper(temp,  f indUpperMethod) ; 
newl  ->  FindLowerO; 

new2  ->  FindUpper(temp,  f indUpperMethod) ; 
new2  ->  FindLowerO; 

UpdateList(newl) ; 
UpdateList(new2) ; 

delete(PE) ; 

> 

void  BB: :UpdateList(PartitionElement  *  newPE) 
{ 

int  index; 

if  (newPE  ->  LipConstant ()  <  lipThresh)  { 
delete  (newPE); 
return; 

> 
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if  (newPE  ->  didBP)  { 

if  (newPE  ->  UpperBoundO  <  .04) 

listBuf f er . Add(newPE) ; 
else 

listBuf fer . Append(newPE) ; 
return; 
} 

real  epsilon  =  .001;  //the  newPE  gets  deleted  if  no  better  solution  than 

//the  current  global  bound  exists 
if  ((  gUpperBound  -  newPE  ->  LowerBoundO )  <  epsilon) 

delete  (newPE) ; 
else  { 

switch(searchMethod){ 

case  BEST.FIRST:  //sort  by  upper  bound 

index  =  listBuf f er .FindUpIndex(newPE  ->  UpperBoundO); 
listBuffer.Insert(newPE,  index); 
break; 

//Sort  the  list  of  PE  by  the  lower  bounds.  Branching  on  the  PE  with  the 
//lowerest  lower  bound  will  ensure  bounding  improving,  hence  convergence 
case  L0WER_C0NV : 

index  =  listBuff er .FindLoIndex(newPE  ->  LowerBoundO); 
listBuf fer. Insert (newPE,  index); 
break; 
case  DEPTH.FIRST:  //sort  by  FIFO,  first  in  first  out 

listBuf fer . Add(newPE) ; 
break; 
case  BREATH.FIRST:  //sort  by  FILO,  first  in  last  out 

listBuf fer.Append(newPE) ; 
break; 
} 
> 
> 


void  BB: :MergeList(PList  *Newlist) 

//Merge  newly  created  list  with  the  current  active  list 

PartitionElement  *previous,  *temp  =  Newlist  ->  list; 
previous  =  temp; 

for  (;  temp;  temp  =  previous  ->  next)  { 

previous  =  temp; 

//!!!  need  change  to  deal  with  different  sorting  strategy 

int  index  =  listBuf fer .FindUpIndex(temp  ->  UpperBoundO); 

listBuffer.Insert(temp,  index); 
} 
> 
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int  main(int  argc,  char**  axgv) 
{ 

char  ch; 
if(!(argc  ==  2))  { 

cerr«"USAGE:  execFile  dataFile\n"; 
exit(l); 
> 

//The  net  class  is  defined  at  the  beginning  to  make  it  accessible  to 
//  the  BB  class. 

Randomize () ; 
ii(net.build(argv[l])  ==  false)  { 

cerr«"Error  in  initializing  net\n"; 

•xit(2); 
} 

BB  bb; 

PartitionElement  *  currentPE; 

currentPE  =  bb.InitializeO ; 
net . showWeights ( ) ; 
bb.ShowBound( currentPE  ); 

while  ( !bb.Converged())  { 
bb.Partition(currentPE) ; 
currentPE  =  bb. Branching () ; 
bb.Bounding(currentPE) ; 
if  (singleStep) 

cin.get(ch) ; 
if  (net.onWeightsQ) 
net .showWeights () ; 
bb.ShowBound( currentPE  ); 
} 
} 


//  // 

//  FNET.cc  as  part  two  of  gota.cc  // 

"  // 

/A**************************************************************************// 

#define  EXT  extern 

#include  "gota.h" 

EXT  Network  net;  //define  a  neural  net  instance  —  net 
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EXT  int  nuraWeights;  //the  size  of  the  weight  vector 

EXT  int  bpCount;  //count  for  the  number  of  bp()  being  called 

EXT  int  showVertex;  //l  for  true,  show  the  vertices  of  the  hyper-rectangle 

EXT  real  lBound,  uBound;  //the  bounds  for  the  initial  partition  element 


EXT  real  startBpThresh; 

EXT  real  slopeThresh; 

EXT  real  lipThresh; 
EXT  real  improveThresh; 

EXT  int  searchMethod; 
EXT  int  findUpperMethod; 

EXT  void  Randomize (void) ; 

EXT  int  unitType; 

EXT  int  errorFunction; 

EXT  int  fanlnSplit; 

EXT  int  bufSize; 

EXT  int  wcount; 

EXT  int  singleStep; 

EXT  real  weightLowerRange; 
EXT  real  weightUpperRange; 
EXT  real  sigmoidPrimeOf f set ; 
EXT  real  errorThreshold; 
EXT  real  *  weightBuf ; 

//The  following  is  mostly  for 

EXT  real  modeSwitchThreshold; 

EXT  real  MaxFactor; 
EXT  real  weightDecay; 


//threshold  to  start  BP  when  error  is 
//less  than  that 

//Stop  local  gradient  search  if  the  norm 
//of  the  gradient  is  less  than  that 


//System  indep.  randomize(void) ; 

//sigmoid  or  assymetric  sigmoid  or  Gaussian 

//type  of  error  function 

//If  true(l),  eta  is  divided  by  #  of  inputs 

//The  size  of  the  weight  array 

//weight  count  for  control  weight  I/O 

//l  for  true,  stop  at  each  iteration 

//weight  range  (lower,  upper) 

//Add  to  sigmoid-prirae  to  keep  it  from  being  0 
//Error  set  to  zero  if  less  than  threshold 
//Global  variables  for  control  weight  I/O 

using  quickprop,  adapted  from  quickpropl.c 

//Inside  threshold,  do  normal  grad  descent 
//otherwise,  jump. 

//Jump  at  most  this  times  last  step 
//Weight  decay 


////////////////////////////////////////////////////////////////// 


// 


//• 


//The  following  are  implementation  of  network  class,  including  // 
//its  parent  classes.  // 


// 


// 


////////////////////////////////////////////////////////////////// 


Hidden: : Hidden (  void  ) 
{ 

numOut  =  numln  =  0; 

count  =  0; 

eta  =  .5; 


//initialize  for  weight  input  check 
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alpha  =  .9; 
gamma  =  1 . ; 
deltaThresh  =  0.  ; 
patternDT  =  0.  ; 
prevDT  =0.1; 
done  =  false; 
net A  =  netB  =  0.  ; 

threshold  =  weightLowerRange  +  (weightUpperRange  -  weightLowerRange)  * 
(real)(  rand()'/.  1000)/1000.0; 

//The  gives  a  random  number  in  [0,1) 
}  //independent  of  system  rand() 

Hidden: : "Hidden () 
{ 

if  (  numln  ) 
{ 

delete(  w  ) ; 
delete(  lVertex  ) ; 
delete(  uVertex  ) ; 
delete(  backLink  ); 
delete(  partial  ); 
delete(  deltaWeight  ); 
delete(  patternDW  ) ; 
delete(  prevPartial  ); 
} 


void  Hidden: :addlink(  void*  fromNode) 

{ 

Hidden  **h  =  new  Hidden* [numln+l] ; 
real  *x  =  new  real [numln+l] ; 
real  *xl  =  new  real[numln+l] 
real  *x2  =  new  real[numln+l] 
real  *x3  =  new  realCnumln+l] 
real  *x4  =  new  real[numln+l] 
real  *x5  =  new  real[numln+l] 
real  *x6  =  new  real[numln+l] 

if    (  numln   ) 
{ 

for   (int   i=0;i<numln;i++) 
{ 

h[i]   =  backLink [i] ; 
x[i]   =  w[i]; 
xl[i]   =  partial [i] ; 
x2[i]    =  deltaWeight [i] ; 
x3[i]    =  patternDW [i] ; 
x4[i]   =  prevPartial [i] ; 
x5[i]    =  lVertex[i]; 
x6[i]   =  uVertex[i]  ; 
> 
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delete(  backLink  ); 

delete(  w  ) ; 

delete(  lVertex  ) 

delete(  uVertex  ) 

delete(  partial  ); 

delete(  deltaWeight  ) ; 

delete(  patternDW  ); 

delete(  prevPartial  ); 
> 

backLink  =  h; 
w  =  x; 

partial  =  xl; 
deltaWeight  =  x2; 
patternDW  =  x3; 
prevPartial  =  x4; 
lVertex  =  x5; 
uVertex  =  x6; 

w[  numln  ]  =  weightLowerRange  +  (weightUpperRange  -  weightLowerRange)  * 
(real)  (rand()  '/.  1000)  /1000.0; 

deltaWeight [  numln  ]  =  0.  ; 
patternDW [  numln  ]  =  0.  ; 
partial [  numln  ]  =  0.  ; 
prevPartial [  numln  ]  =  0.  ; 

backLink [  numln  ]  =  (Hidden*)  fromNode; 
backLink [  numln++  ]->incNumout() ; 

> 


//put  patternDT  DW  in  bp  directly  910502 

void  Hidden: :getDeltaW(  void  ) 

{ 

if ( Idone) 
{ 

patternDT  +=  eta  *  delta  ;   //delta  =  threshol  partial 

delta  =  0.;  //reset  for  Hidden: :bp  delta  accumulation 

for(int  i=0;i<numln;i++) 
{ 

patternDW[i]  +=  eta  *  partialCi]; 
backLink [i] ->getDeltaW( ) ; 
} 
> 

done  =  true; 
> 

void  Hidden: :sampleUpdate(  void  ) 

•c 

if  (  ! updated  ) 
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{ 

deltaThresh  =  patternDT  +  alpha  *  deltaThresh; 
patternDT  =0.;  //reset  for  next  epoch 

threshold  +=  deltaThresh; 
for(int  i=0;i<numln;i++) 
{ 

deltaWeight [i]  =  patternDW[i]  +  alpha  *  deltaWeight  [i] ; 
w[i]  +=  deltaWeight [i] ; 
patternDWCi]  =  0. ; 
backLink[i]->sampleUpdate() ; 
//recursive,  stop  at  Input  node 
> 
> 

updated  =  true; 
} 

void  Hidden: :insUpdate(  void  ) 
{ 

if  (  ! updated  ) 
{ 

deltaThresh  =  patternDT  +  alpha  *  deltaThresh; 

patternDT  =  0. ; 

//Note  patternDT  is  not  cumulative  in  this  case.  Can  not  use  delta 

//because  it  is  reset  to  zero  in  Hidden: :bp  to  handle  sample  training 

threshold  +=  deltaThresh; 
for(int  i=0;i<numln;i++) 
{ 

deltaWeight [i]  =  eta  *  partial [i]  +  alpha  *  deltaWeight [i] ; 
w[i]  +=  deltaWeight [i] ; 
backLink[i]->  insUpdate(); 
//recursive,  stop  at  Input  node 
} 
> 

updated  =  true; 
> 


void  Hidden: :qpUpdate(  void  ) 
{ 

real  tempDelta  =  0. ,  shrinkFactor; 

if  (  ! updated  )  { 

shrinkFactor  =  MaxFactor  /  (  1.0  +  MaxFactor  ); 

if  (  deltaThresh  >  modeSwitchThreshold  ){ 

//  Last  step  was  signif .  +ive 

if  (  patternDT  >  0.0  )        //  Add  in  epsilon  if  +ive  slope 
tempDelta  +=  (fanlnSplit  ? 

(  patternDT  /  numln)  :  (  patternDT  )); 

//patternDT  is  eta*delta  already 
//  If  slope  >  (or  close  to)  prev  slope,  take  max  size  step. 
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if  (  patternDT  >  (shrinkFactor  *  prevDT)  ) 

tempDelta  +=  (  MaxFactor  *  deltaThresh)  ; 
else  //  Use  quadratic  estimate 

tempDelta  +=  patternDT  /((prevDT  -  patternDT))  //need  prevent  /0. 

*  deltaThresh  ; 
} 

else  if  (  deltaThresh  <  -  modeSwitchThreshold  )  { 
//  Last  step  was  signif.  negative.... 

if  (  patternDT  <  0.0  )        //  Add  in  epsilon  if  negative  slope 
tempDelta  +=  (fanlnSplit  ? 

(  patternDT  /  numln  )  :  (  patternDT  )); 

//patternDT  is  eta*delta  already 
//  If  slope  <  (or  close  to)  prev  slope,  take  max  size  step. 
if  (  patternDT  <  (shrinkFactor  *  prevDT)  ) 
tempDelta  +=  (  MaxFactor  *  deltaThresh); 
else  //  Use  quadratic  estimate 

tempDelta  +=  patternDT  /((prevDT  -  patternDT)) 

*  deltaThresh  ; 
} 

else  '  //  Normal  gradient  descent,  complete  with  momentum 

//  DidGradient++; 

tempDelta  +=  ((fanlnSplit  ?  (  patternDT  /  numln  )  :  (  patternDT  )) 
+  alpha  *  deltaThresh  ) ; 

} 

//  Set  delta  weight,  and  adjust  the  weight  itself. 
deltaThresh  =  tempDelta; 
threshold  +=  deltaThresh; 
prevDT  =  patternDT; 
patternDT  =  0.  ; 

for(int  i=0;i<numln;i++)  { 
tempDelta  =  0. ; 
if  (  deltaWeight[i]  >  modeSwitchThreshold  ){ 

//  Last  step  was  signif.  +ive 

if  (  patternDW[i]  >  0.0  )        //  Add  in  epsilon  if  +ive  slope 
tempDelta  +=  (fanlnSplit  ? 

(  patternDW[i]  /  numln)  :  (  patternDW[i]  )); 
//  If  slope  >  (or  close  to)  prev  slope,  take  max  size  step, 
if  (  patternDV[i]  >  (shrinkFactor  *  prevPartialCi] )  ) 

tempDelta  +=  (  MaxFactor  *  deltaWeight [i] )  ; 
else  //  Use  quadratic  estimate 

tempDelta  +=  patternDW[i]  /( (prevPartial [i]  -  patternDW[i] ) ) 
*  deltaWeight  [i]  ; 
} 

else  if  (  deltaWeight [i]  <  -  modeSwitchThreshold  )  { 
//  Last  step  was  signif.  nagtive.... 

if  (  patternDW[i]  <  0.0  )         //  Add  in  epsilon  if  nagtive  slope 
tempDelta  +=  (fanlnSplit  ? 

(  patternDW[i]  /  numln)  :  (  patternDW[i]  )); 
//  If  slope  <  (or  close  to)  prev  slope,  take  max  size  step. 
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if  (  patternDWCi]  <  (shrinkFactor  *  prevPartialCi] )  ) 

tempDelta  +=  (  MaxFactor  *  deltaWeight [i] ) ; 
else  //  Use  quadratic  estimate 

tempDelta  +=  patternDW[i]  /( (prevPartial [i]  -  patternDW[i] )) 
*  deltaWeight [i]  ; 
} 

else  i  //  Normal  gradient  descent,  complete  with  momentum 

//DidGradient++; 

tempDelta  +=  (fanlnSplit  ?  (  patternDW[i]  /  numln)  :  (  patternDW[i]  )) 
+  alpha  *  deltaWeight [i] ; 

> 

//  Set  delta  weight,    and  adjust  the  weight   itself. 
deltaWeight [i]   =   tempDelta; 
w[i]   +=  deltaWeight [i] ; 
prevPartialCi]   =  patternDWCi] ; 
patternDWCi]   =  0. ; 
backLinkCi]->  qpUpdateO; 
} 
} 

updated  =  true; 
> 


/*  The  old  bp  still  used  in  net.cc 
void  Output: :bp(  real  target   ) 
{ 

updated  =  false; 

delta  =   (target  -  out)   *     ActPrime(out) ; 

patternDT  +=   eta  *  delta; 

for(int   i=0;i<numln;i++) 

{ 

partialCi]   =  delta  *  backLinkCi]   ->  getout(); 
patternDWCi]  +=  eta  *  partialCi]; 
//partial  direvative 
backLinkCi] ->bp(  wCi] ,   delta  ); 
//back-prop  to  Hidden  node 
} 
} 
•/ 

void  Output: :bp(  real  Error   ) 
■C 

updated  =  false; 

delta  =  Error  *   ActPrime(out)  ; 

patternDT  +=   eta  *  delta; 

for(int   i=0;i<numln;i++) 

{ 

partialCi]   =  delta  *  backLinkCi]   ->  getoutO; 

patternDWCi]   +=   eta  *  partialCi]; 

//partial  direvative 
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backLink[i]->bp(  w[i] ,   delta  ); 
//back-prop  to  Hidden  node 

> 
> 


void  Hidden: :bp(  real  weight,  real  upDelta  ) 
{ 

updated  =  false; 

++count ; 

delta+=  upDelta  *  weight; 

if  (  count  ==  numOut  ) 

//back-prop  only  when  it  gets  all  info  from  its  ancestors 

{ 

delta  *=  ActPrime(out) ; 
patternDT  +=  eta  *  delta; 
for(int  i=0;i<numln;i++) 
{ 

partial [i]   =  delta  *  backLink[i]    ->  getout(); 
patternDW[i]   +=   eta  *  partial[i]; 
backLink[i]->bp(  w[i],   delta  ); 
} 

count  =  0; 

delta  =0.;        //reset  for  next  pattern  in  sample  training 
} 
} 


real  Hidden: : output () 
{ 

delta  =  in  =  0. ; 

for(int  i=0;i<numln;i++) 

in  +=  w[i]    *  backLink[i]->output() ; 

in  +=  threshold; 

return  out  =  Activation  (in); 
> 

real  Hidden: :GetGrad() 
{ 

++count; 

if    (   count  ==  numOut   )    { 
in  =  0 .  ; 
for(int  i=0;i<numln;i++)   { 

in  +=   (patternDW[i])*(patternDW[i]) 
in  +=  backLink[i]->GetGrad() ; 
> 

count  =  0; 

in  +=  (patternDT) * (patternDT) ; 
return  in; 
} 


200 


return  0 . ; 
> 

real  Output : :GetGrad() 
{ 

in  =  0 .  ; 

for(int  i=0;i<numln;i++)  { 

in  +=    (patternDW[i])*(patternDW[i]); 
in  +=  backLink[i]->GetGrad(); 

} 

in  +=   (patternDT)*(patternDT); 

return  in; 
} 

real  Hidden: :Activation(real  netln) 
{ 

switch  (unitType)  { 
case  SIGMOID: 

if  (netln  <  -15.) 

return(out  =  -.5) ; 
else  if  (netln  >  15. ) 

return (out  =  .  5) ; 
else 

return  out  =  l./(  1.  +  exp(  -gamma  *  netln  )  )  -  .5; 
case  ASYMSIGMOID: 
if  (netln  <  -15.) 

return(out  =  .0) ; 
else  if  (netln  >  15. ) 

return(out  =  1.0) ; 
else 

return  out  =  l./(  1.  +  exp(  -gamma  *  netln  )  ); 
case  GAUSSIAN: 
//to  be  defined 
break; 
> 
} 


real  Hidden: :ActPrime(real  Value) 
{ 

switch  (unitType)  { 

case  SIGMOID:  //symmetrical  sigmoid 

return  (sigmoidPrimeOff set  +  (.25  -  Value  *  Value)); 
case  ASYMSIGMOID:  //asymmetrical  sigmoid 

return  (sigmoidPrimeOff set  +  Value  *  (1.  -  Value)); 
case  GAUSSIAN: 
//to  be  defined 
return  0. ; 
> 
} 
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void  Hidden:  :  setWeightsQ 
{ 

++count ; 

if (count  ==  numOut) 

{ 

threshold  =  weightBufC  wcount++] ; 
for  (int  i  =  0;  i  <numln;  i++) 
{ 

w[i]  =  weightBufC  wcount++  ]; 
backLinkCi]  ->  setWeightsO  ; 
} 

count  =  0; 
} 
} 

void  Hidden: :setVertex(int  type) 
{ 

++count; 
int  i; 

if (count  ==  numOut)  { 
switch  (type)  { 
case  1: 

tLV=  weightBufC  wcount++] ; 
for  (  i  =  0;  i  <numln;  i++)  { 

lVertexCi]  =  weightBuf[  wcount++  ]; 
backLinkCi]  ->  setVertex(type) ; 
} 

count  =  0; 
break; 
case  2: 

tUV  =  weightBufC  wcount++] ; 
for  (  i  =  0;  i  <numln;  i++)  { 

uVertex[i]  =  weightBufC  wcount++  ]; 
backLink[i]  ->  setVertex(type) ; 
> 

count  =  0; 
break; 
} 
} 
} 

void  Hidden: :putWeights() 
{ 

++count; 

if (count  ==  numOut) 

{ 

weightBufC  wcount++]  =  threshold; 
for  (int  i  =  0;  i  <numln;  i++) 
{ 

weightBufC  wcount++  ]  =  wCi] ; 
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backLink[i]   ->  putWeightsO  ; 
} 
count   =  0; 

> 
} 


void  Hidden: :displayWeights() 
{ 

++count ; 

if  (count  ==  numOut) 

{ 

cout«    "threshold:   "«  threshold«"\n"  ; 

for  (int  i  =  0;  i  <numln;  i++) 

{ 

cout«"weight :      "«  w[i]«"\n"; 
backLink[i]  ->  displayWeightsO  ; 
> 

count  =  0 ; 
> 
} 


real  Hidden: :findA() 
{ 

//IMnote  this  works  only  for  single  h-layer  network,  since  findAQ  from 

//the  input  node  is  an  constant  w.r.t.  the  weights 

net A  =  0. ; 

for   (int  i  =  0;    i<  numln;    i++)   {  //need  reset  to  0  after  use 

netA  +=  lVertex[i]    *  Activation(backLink[i]   ->  findAO); 

} 

netA  +=  tLV; 

return  netA; 
} 

real  Hidden : : f  indB ( ) 
{ 

//IMnote  this  works  only  for  single  h-layer  network 

netB  =  0. ; 

for  (int  i  =  0;  i<  numln;  i++)  {         //need  reset  to  0  after  use 
netB  +=  uVertexCi]  *  Activation(backLink[i]  ->  findBO); 

> 

netB  +=  tUV; 

return  netB; 
> 


void  Output: :findRange() 
{ 

netA  =  netB  =  0. ; 

for  (int  i  =  0;  i<  numln;  i++)  { 
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if    UVertexCi]   >  0) 

netA  +=  lVertexCi]    *  Activation(backLinkCi]   ->  findAO) 
else 

netA  +=  lVertexCi]    *  Activation(backLinkCi]   ->  findBO) 
if   (uVertexCi]   >  0) 

netB  +=  uVertexCi]    *   Activation(backLink[i]   ->  findBO) 
else 

netB  +=  uVertexCi]    *  Activation(backLink[i]   ->  findAO) 
> 

netA  +=  tLV; 
netB  +=  tUV; 
> 


void     Output: :setWeights() 
{ 

threshold  =  weightBufC  wcount++] ; 
for   (int  i  =  0;    i  <numln;    i++) 
{ 

w[i]    =  weightBufC  wcount++  ]; 
backLinkCi]   ->  setWeightsO; 
} 
> 

void     Output: :setVertex( int  type) 
•C 

int   i; 

snitch   (type)   { 

case   1: 

tLV  =  weightBufC  wcount++] ; 
for   (   i  =  0;    i  <numln;    i++)   { 

lVertexCi]   =  weightBufC  wcount++  ]; 
backLinkCi]   ->  setVertex(type) ; 
} 

break; 
case  2: 

tUV  =  weightBufC  wcount++] ; 
for  (  i  =  0;  i  <numln;  i++)  { 

uVertex  Ci]  =  weightBufC  wcount++  ]; 
backLinkCi]  ->  setVertex(type) ; 
} 

break; 
} 
} 


void     Output : : putWeights ( ) 
{ 

weightBufC  wcount++]    =  threshold; 

for   (int  i  =  0;    i  <numln;    i++) 

■c 
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weightBuf[  wcount++  ]  =  w[i] ; 
backLink[i]  ->  putWeightsO ; 
} 
> 


void  Output : : displayWeights ( ) 
{ 

cout«    "threshold:   "<<  threshold«"\n"; 

lor  (int  i  =  0;  i  <numln;  i++) 

{ 

cout«"weight:      "«  w[i]«"\n"; 
backLink[i]  ->  displayWeights () ; 
} 
} 


bool  Pattern: :getMem(int  inSize,  int  outsize) 
{ 

in  =  new  real[inSize] ; 

if(!in) 

{ 

cerr«"Error  during  memory  allocation  of  Pattern,  in []  !\n"; 

return(false) ; 
> 

out   =  new  real [outSize] ; 
if (lout) 
{ 

cerr«"Error  during  memory  allocation  of  Pattern,  out  []  !\n"  ; 

return(false) ; 
> 

return  true; 
> 


Network: :Network(void) 

■c 

nln  =  nOut  =  nHidden  =  0; 
totalError  =  10000.  ; 
iter  =  0; 


bool  Network: : build (char*  inputFile) 
{ 

int  i,  j,  fromNode,  fromEnd,  toNode,  toEnd; 

char  check,  junkClOOO],  inString[20] ; 

#ifdef  TURB0.CPP  //redefine  cin  for  using  redirect 

ifstream  cin( inputFile,  ios::in);    //operator  <  in  command  line.  But 
if  (!cin)  //this  can  not  be  used  with  interactive 
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{  //runing  of  the  program. 

cerr  «  "Error  opening  file  data  file  \n"; 

return(false) ; 
} 
#else 

istream  cin(inputFile,  io_readonly ,a_use) ; 

if  (!cin) 

{ 

cerr  «  "Error  opening  file  data  file  \n"; 

return(f alse) ; 
} 
#endif 

cin.get(check) ; 

cin.putback(check) ; 

while  (check  ==  '#'  ||  check  ==  '\n'H 

cin.getline(junk,  90,  '\n'); 

cin.get(check) ; 

cin.putback(check) ; 
> 

cin  »  inString  »  searchMethod; 
cin  »  inString  »  f indUpperMethod; 
cin  »  inString  »  startBpThresh; 
cin  »  inString  »  lipThresh; 
cin  »  inString  »  slopeThresh; 
cin  »  inString  »  improveThresh; 

cin  »  inString  »  learningRate  ; 
cin  »  inString  »  momemtum  ; 
cin  »  inString  »  gainFactor  ; 
cin  »  inString  »  stopError  ; 
cin  »  inString  »  errorThreshold  ; 
cin  »  inString  »  lBound  »  uBound; 
cin  »  inString  »  maxlteration  ; 
cin  »  inString  »  singleStep  ; 
cin  »  inString  »  showVertex; 
cin  »  inString  »  errorFunction; 
cin  »  inString  »  fanlnSplit; 
cin  »  inString; 
cin  »  sigmoidPrimeOff set; 

cin  »  inString  »  unitType; 
cin  »  inString  »  nln; 
cin  »  inString  »  nHidden; 
cin  »  inString  »  nOut; 
cin  »  inString; 

Input  tempi; 

inHode  =  new  Input    [nln] ; 

if(linHode) 
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cerr«"Error  during  memory  allocation  lor  Input  node\n" ; 

return(false) ; 
> 
for  (i  =  0;  i  <  nln;  i++) 

inNode[i]  =  tempi; 

Hidden  tempH; 

hHode  =  new  Hidden  [nHidden] ; 

if (! hHode) 

{ 

cerr«"Error  during  memory  allocation  for  Hidden  node\n" ; 

return(false) ; 
} 

for  (i  =  0;  i  <  nHidden;  i++) 
{ 

hHode [i]  =  tempH; 

hHode [i] .setParameterClearningRate,  momemtum,  gainFactor) ; 

Output  tempO ; 

outHode  =  new  Output  [nOut] ; 

if (loutNode) 

{ 

cerr«"Error  during  memory  allocation  for  outNode\n"; 

return(false); 
> 

for   (i  =  0;    i  <  nOut;    i++) 
{ 

outNodeCi]  =  tempO; 

outHode[i] .setParameterClearningRate,  momemtum,  gainFactor); 


cin  »  inString; 

while(strcmp(inString,"end")  !=  0) 
{ 

cin  »  fromNode  »  fromEnd  »  toNode  »  toEnd; 
//set  link  from  node(i-j)  to  node(k-l) 
if  (strcmp(inString,  "in->out")  ==  0) 
{ 

for  (  i  =  fromNode;  i  <=  fromEnd;  i++) 
{ 

for  (  j  =  toNode;  j  <=  toEnd;  j++) 
{ 

outNode[j]  +=  inHode[i]; 
bufSize++; 
} 
> 
} 

if  (strcmp( inString,  "in->hid")  ==  0) 
{ 
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for  (i  =  fromNode;  i  <=  fromEnd;  i++) 
{ 

for  (  j  =  toKode;  j  <=  toEnd;  j++) 
{ 

hMode[j]  +=  inNode[i]; 
bufSize++; 
> 
> 
} 

if  (strcmp( inString,  "hid->out")  ==  0) 
{ 

for  (  i  =  fromNode;  i  <=  fromEnd;  i++) 
{ 

for  (  j  =  toNode;  j  <=  toEnd;  j++) 
{ 

outNode[j]  +=  hNode[i]; 
bufSize++; 
> 
} 
} 

if  (strcmp(inString,  "hid->hid")  ==  0) 
{ 

for  (  i  =  fromNode;  i  <=  fromEnd;  i++) 
{ 

for  (  j  =  toNode;  j  <=  toEnd;  j++) 
{ 

hNodeCj]  +=  hNode[i]; 
bufSize++; 
> 
> 
} 

cin  »  inString; 
> 

bufSize  +=  nHidden; 

bufSize  +=  nOut; 

weightBuf  =  new  real  [bufSize] ; 

if(!weightBuf ) 

{ 

cerr«"Error  during  memory  allocation  for  weightBuf \n"; 

return(false) ; 
> 

cin  »  inString  »  trainingHethod;    //0,1,2  for  sample,  random  instance 
cin  »  inString  »  weightsln;        //and  sequential  instance  training 
cin  »  inString  »  inWFile;  //3  for  quickprop 

cin  »  inString  »  weightsOut; 
cin  »  inString  »  outWFile; 
cin  »  inString  »  weightsOn; 
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cin  »  inString  »  nPattern; 
cin  »  inString  »  inWidth; 
cin  »  inString  »  inDepth; 
cin  »  inString  »  outWidth; 
cin  »  inString  »  outDepth; 

patternError  =  new  real [nPattern]  ; 

if  (! patternError) 

{ 

cout«"Error  in  memory  lor  patternError  \n"  ; 

exit(l); 
> 

pattern  =  new  Pattern[nPattern] ; 

if ( ! pattern) 

{ 

cerr«"Error  during  memory  allocation  for  pattern\n"; 

return(false) ; 
> 

cin  »  inString; 

for(i  =  0;  i  <  nPattern;  i++) 

if(!pattern[i] .getMem(inWidth  *  inDepth,  outWidth  *  outDepth)) 
return(false) ; 

for(int  atPattern  =  0;  atPattern  <  nPattern;  atPattern++) 
{ 

for(  i  =  0;  i  <  inWidth  *  inDepth;  i++) 
cin  »  pattern [atPattern] . in[i] ; 

for(  i  =  0;  i  <  outWidth  *  outDepth;  i++) 
cin  »  pattern [atPattern] .out [i] ; 

} 

cin  »  inString; 

if  (  strcmp(inString,  "endData")  !=  0)  { 

cerr  «"Input  file  data  error  !!  \n"; 

return  (false); 

> 

ranSequence  =  new  int [nPattern] ; 

if ( ! ranSequence) 

{ 

cerr«"Error  during  memory  allocation  for  random  sequence\n"; 

return(false) ; 
> 
for  (  i  =  0;  i  <  nPattern;  i++) 

ranSequence [i]  =  i; 


return  true; 


209 


void  Network: :randSample( int  rFactor)  //randomize  the  pattern  sequence 
{  //the  level  of  shuffering  can  be 

int  rani,  ran2,  temp;  //controled  by  rFactor  rangine  from 

int  range  =  (int)  nPattern/rFactor;  //2  to  nPattern 
for  (  int  i  =  0;  i  <  range;  i++  ) 
{ 
#ifdef  GNU.CPP 

rani  =  randomO  '/,  nPattern; 
ran2  =  randomO  '/.  nPattern; 
#else 

rani  =  random(  nPattern  ) ; 
ran2  =  random(  nPattern  ) ; 
#endif 

if  (  rani  !=  ran2) 
{ 

temp  =  ranSequence[ranl] ; 
ranSequence[ranl]  =  ranSequence[ran2] ; 
ranSequence[ran2]  =  temp; 
> 
} 
} 


void  Network: :forward() 
{ 

totalError  =  0.  ; 

for  (int  atPattern  =  0;  atPattern  <  nPattern;  atPattern++  ) 

{ 

patternError [atPattern]  =0.; 

for  (int  atlnput  =  0;  atlnput  <  nln;  atlnput++  ) 

inNode [atlnput] . input (pattern[atPattern] . in [atlnput] ) ; 

for  (int  atOut  =  0;  atOut  <  nOut ;  at0ut++) 
{ 

real  temp  =  outNode[at0ut] .output () ; 

real  outputError  =  OutputError (pattern [atPattern] .out [atOut] ,temp) ; 
patternError [atPattern]  +=  outputError*  outputError; 
} 

totalError  +=  patternError [atPattern] ; 
> 
} 


real  Network: : OutputError (real  target,  real  outcome) 
{ 

real  diff  =  target  -  outcome; 

switch  (errorFunctionH 

case  SumOf Square : 

if  (  fabs(diff)  <  errorThreshold) 
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return  (0.); 
else 

return  (diff); 
case  HyperError:  //Using  atanh  for  error 

if  (diff  <  -.999999) 

return  (-17.0); 
else  if  (diff  >  .999999) 

return  (17.0); 
else 

return  (log  ((1.0  +  diff)/(1.0  -  diff))); 
} 
} 


void  Network: :backPr op () 
{ 

totalError  =  0. ; 

switch(trainingMethod){ 

case  SAMPLE:  //sample  or  Epoch  training 

for  (int  atPattern  =  0;  atPattern  <  nPattern;  atPattern++  ){ 
patternError [atPattern]  =  0.; 
for  (int  at Input  =  0;  at Input  <  nln;  at Input ++  ) 

inNode [atlnput] . input(pattern[atPattern] . in [at Input] ) ; 

for  (int  atOut  =  0;  atOut  <  nOut;  at0ut++)  { 
real  temp  =  outNode[at0ut] .output () ; 

real  outputError  =  OutputError(pattern [atPattern] .out [atOut] , temp) ; 
if  (outputError  !=  0.)  { 

patternError [atPattern]  +=  outputError  *outputError  ; 
outNode [atOut] .bp(  outputError  ); 
} 
} 
} 

for  (  int  atOut  =  0;  atOut  <  nOut;  at0ut++) 

outNode [atOut] . sampleUpdateO  ; 
break; 

case  RANINSTANCE:  //randomized  instance  or  pattern  training 

randSample(2) ; 

for  (int  i  =  0;  i  <  nPattern;  i++  )  { 
patternError [  ranSequence[i]  ]  =  0.; 
for  (int  atlnput  =  0;  atlnput  <  nln;  atlnput++  ) 
inNode [at Input] .input (pattern[  ranSequence[i]  ]. in [atlnput]  ) ; 

for  (int  atOut  =  0;  atOut  <  nOut ;  at0ut++)  { 
real  temp  =  outNode [atOut] .output () ; 

real  outputError  =  OutputError(pattem[ranSequence[i]]  .out  [atOut]  , temp) ; 
if  (outputError  !=  0.)  { 

patternError [ranSequence[i]]  +=  outputError  *outputError  ; 

outNode [atOut] .bp(  outputError  ); 
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> 
> 

for  (atOut  =  0;  atOut  <  nOut;  atOut++) 
outNode [atOut]  .  insUpdateQ  ; 
> 

break; 
case  INSTANCE:  //sequential  instance  or  pattern  training 

lor  (atPattern  =  0;  atPattern  <  nPattern;  atPattern++  )  { 
patternError [atPattern]  =0.; 
lor  (int  atlnput  =  0;  atlnput  <  nln;  atlnput++  ) 

inNode [at Input] . input (pattern [atPattern] . in [atlnput] ) ; 

lor  (int  atOut  =  0;  atOut  <  nOut ;  atOut++)  { 
real  temp  =  outNode [atOut] .output () ; 

real  outputError  =  OutputError(pattern [atPattern] .out [atOut] , temp) ; 
il  (outputError  !=  0.)  { 

patternError [atPattern]  +=  outputError  *outputError  ; 
outNode [atOut] .bp(  outputError  ); 
} 
> 

lor  (atOut  =  0;  atOut  <  nOut;  atOut++) 
outNode  [atOut]  .insUpdateQ  ; 
} 

break; 
case  qUICKPROP:  //quickpropagation  training 

lor  (atPattern  =  0;  atPattern  <  nPattern;  atPattern++  )  { 
patternError [atPattern]  =  0.; 

lor  (int  atlnput  =  0;  atlnput  <  nln;  atlnput++  ) 
inNode [atlnput] . input (pattern [atPattern] . in [at Input] ) ; 

lor  (int  atOut  =  0;  atOut  <  nOut;  atOut++)  { 
real  temp  =  outNode [atOut] . output () ; 

real  outputError  =  OutputError (pattern [atPattern] .out [atOut] , temp) ; 
il  (outputError  !=  0.)  { 

patternError [atPattern]  +=  outputError  *outputError  ; 
outNode [atOut] .bp(  outputError  ); 
} 
} 
} 
lor  (atOut  =  0;  atOut  <  nOut;  atOut++) 

outNode  [atOut]  .qpUpdateO; 
break; 
delault : 

//cout  «"  Wrong  training  method  entered,  use  default  INSTANCE  training!  \n"; 
lor  (atPattern  =  0;  atPattern  <  nPattern;  atPattern++  )  { 
patternError [atPattern]  =  0.; 
lor  (int  atlnput  =  0;  atlnput  <  nln;  atlnput++  ) 

inNode [atlnput] . input(pattern[atPattern] . in [at Input] ) ; 


lor  (int  atOut  =  0;  atOut  <  nOut ;  atOut++)  { 
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real  temp  =  outNode[atOut] . output () ; 

real  outputError  =  OutputError (pattern [atPattern] .out [atOut] , temp) ; 

if  (outputError  !=  0.)  { 

patternError [atPattern]  +=  outputError  *outputError  ; 
outNode[atOut] .bp(  outputError  ); 
} 
} 

for  (atOut  =  0;  atOut  <  nOut ;  at0ut++) 
outHodeCatOut]  .  insUpdateO  ; 
} 

break; 
> 

for  (int  atPattern  =  0;  atPattern  <  nPattern;  atPattern++  ) 
totalError  +=  patternError [atPattern]  ; 
} 


bool  Network: :trained( void) 
{ 

if  ((totalError  <  stopError)  II  (iter  >=  maxlteration) ) 
return  true; 

return  false; 
} 


void  Network: :FindSigRange( void) 
{ 

for  (int  atOut  =  0;  atOut  <  nOut;  at0ut++)  { 
outNode[atOut] .f indRangeO ; 

> 
} 


real  Network: :findGrad (void) 
{ 

real  temp,  outputError  ; 

for  (int  atPattern  =  0;  atPattern  <  nPattern;  atPattern++  ) 

{ 

for   (int  atlnput  =  0;    atlnput  <  nln;    atlnput++   ) 

inNode [at Input] . input (pattern [atPattern] . in  [atlnput] ) ; 

for   (int  atOut  =  0;    atOut  <  nOut;    at0ut++) 
{ 

temp  =  outNode[at0ut] .output () ; 

outputError  =  OutputError (pattern [atPattern] .out [atOut] , temp) ; 

outNode[at0ut] .bp(  outputError  ); 
} 
} 

temp  =  0. ; 

for  (int  atOut  =  0;  atOut  <  nOut ;  at0ut++)  { 
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temp  +=  outNode [atOut] .GetGradQ ; 

> 

return  temp  =  sqrt(temp); 

> 

real  Network: :LipConst( void) 
{ 

real  lipc  =  0. ; 

lor  (int  atPattern  =  0;  atPattern  <  nPattern;  atPattern++  )  { 
for  (int  atlnput  =  0;  atlnput  <  nln;  atlnput++  ) 

inNode [at Input] . input (pattern[atPattern] . in[atlnput] ) ; 

FindS  igRangeO  ; 

real  temp=0.,  tempi,  temp2,  temp3,  temp4,  sum3=0.,  sum4=0.; 
lor  (atlnput  =  0;  atlnput  <  nln;  atlnput++  ) 

temp  +=  pattern [atPattern]  . in[atlnput] *pattern [atPattern] .in[atlnput] ; 
temp  +=  1 . ; 
temp  =  sqrt(temp);  //(l+  sum  x"2)*(l/2) 

//gets    I y-o I 

il    (pattern [atPattern] .out [0]   >    .5)      //works  for  one  output  net  only 
tempi  =  pattern [atPattern] .out [0]   - 

outNode [0] .Activation(outNode[0] .getA()) ; 
else 

tempi  =  outNode [0]  . Activation(outNode[0]  .getBQ) 
-  pattern [atPattern] .out[0]    ; 

//get  1 '   for  the  output  node 
temp2  =  outNode [0] .getA() ; 
il   (temp2  >  0.    )   { 

temp2  =  outNode[0] . Activation(temp2) ; 

temp2  =   temp2*(l-temp2); 
> 
else  if    (    (temp2  =     outNode [0] .getB() )   <  0.)    { 

temp2  =  outNode [0] . Activation(temp2) ; 

temp2  =  temp2*(l-temp2) ; 
> 
else 

temp2  =  .25; 

for  (int  atHidden  =  0;  atHidden  <  nHidden;  atHidden++)  { 
temp3  =  hNode [atHidden] .getB() ; 
temp3  =  hNode [0] . Activation(temp3) ; 
sum3  +=  temp3; 

//get  f '  for  the  hidden  node 
temp4  =  hNode [atHidden] .getA() ; 
il  (temp4  >  0.  )  { 
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temp4  =  hNode[0] . Activation(temp4) ; 

temp4  =  temp4*(l-temp4) ; 
} 
else  if  (  (temp4  =  hNode[atHidden] .getB() )  <  0.)  { 

temp4  =  hNode[0] . Activation(temp4) ; 

temp4  =  temp4*(l-temp4) ; 
} 
else 

temp4  =  .25; 

sum4  +=  temp4; 
} 

sum3  +=  1. ; 
sum3  =  sqrt  (sum3); 

lipc  +=  temp*templ*temp2*sum3*sum4; 
} 

lipc  =  gainFactor*gainFactor*lipc; 
return  lipc; 

} 


void  Network: :SetVertices (real  *lv,  real  *uv) 
{ 

int  lower  =  1,  upper  =  2; 

SetWeightsQv,  lower); 

SetWeights(uv,  upper); 
} 


void  Network: :SetWeights (real*  Weights) 
{ 

//To  keep  Network  class  intacted,  we  reset  and  weightBuf. 

for  (int  i  =  0;  i  <  bufSize;  i++) 
weightBuf [i]  =  Weights [i] ; 

//Calls  for  set  weights  through  backward  recursion, 
for  (int  atOut  =  0;  atOut  <  nOut;  at0ut++) 

outNode[atOut]  .  setWeightsQ  ; 
wcount  =  0; 

> 

void  Network: :SetWeights (real*  Weights,  int  type) 
{ 

for  (int  i  =  0;  i  <  bufSize;  i++) 
weightBuf [i]  =  Weights [i] ; 

for  (int  atOut  =  0;  atOut  <  nOut ;  atOut++) 

outNode[atOut] . setVertex(type) ; 
wcount  =  0; 
} 
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void  Network: :displayError() 
{ 

totalError  =0.  ; 

#ifdef  DISPLAY_YES 

textcolor(VHITE); 

for  (int  atPattern  =  0;  atPattern  <  nPattern;  atPattern++) 

{ 

cprintf  ("Pattern  [*/.d]    Error:      '/.f  \n\r"  .atPattern, 

patternError [atPattern] ) ; 
totalError  +=  patternError [atPattern] ; 
} 

cout  «  "\n\n"; 

textcolor(GREEN); 

cprintf  ("Iter  '/.d  TotalError:  '/.f  \n\r"  ,++iter ,  totalError); 

return; 

#endif 

for  (int  atPattern  =  0;  atPattern  <  nPattern;  atPattern++) 

{ 

//cout  «"Pattern  ["«  atPattern  «"]  Error:   "« 

//    patternError [atPattern]  «"\n"; 

totalError  +=  patternError [atPattern] ; 
} 

//cout  «  "\n"; 

cout  «  "Iter  "  «  ++iter  «  "  Total  Error:   "«  totalError  «"\nH- 
} 

void  Network: :showWeights() 
{ 

cout«"==  Threshold  and  Weights  ==\n"; 

for  (int  atOut  =  0;  atOut  <  nOut ;  at0ut++) 
outNode [atOut] . displayWeights ( ) ; 
> 
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/////////////////////////////////////////////// 

//  // 

//  Input  data  file  for  parity  three  problem  // 
//  // 

/////////////////////////////////////////////// 

#This  is  the  input  file.  Most  of  the  parameters  are  self-suggestive.  # 

#There  are  six  parts  of  the  input  file.  The  first  part  defines  the  branch  # 

#and  bound  search  strategies,  local  search  methods,  and  thresholds.  The  # 

#second  part  gives  neural  network  parameter  values;  The  third  part  determines  # 

#the  network  topology.  The  fourth  part  controls  training  method  and  weight  # 

#1/0  options.  The  fifth  part  defines  the  input  data  set  and  the  last  is  the  # 

Straining  data.  # 


searchMethod: 

1 

findUBMethod: 

2 

startBpThresh: 

2. 

lipschitzThresh: 

0. 

slopeThresh: 

0.00001 

improveThresh: 

0.00001 

learn ingRate: 

0.5 

momentum: 

0.9 

gainFactor : 

1. 

stopError : 

0.04 

errorThreshold : 

0.1 

weightRange: 

-10.  10 

maxlteration: 

1000 

singleStep: 

0 

showVertex: 

0 

errorFunction: 

0 

fanlnSplit: 

0 

sigPrimeOff set : 

0.1 

unitType:  1 

nlnput :   3 

nHidden:   3 

nOut :     1 

network: 

in->hid  0  2  0  2 

hid->out  0  2  0  0 

end 

trainingMethod  0 

weightsln?     0 

inWFile:       netl.w 

weightsOut?    0 

outWFile:      ne 

itO.w 

weightsOn?     0 
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nPattern:  8 

inWidth:  3 

inDepth:  1 

outWidth:  1 

outDepth:  1 

beginData: 

0.  0.  0. 

0. 

0.  0.  1. 

1. 

0.  1.  0. 

i. 

0.  1.  1. 
0. 

1.  0.  0. 

1. 

1.  0.  1. 

0. 

1.  1.  0. 

0. 

1.  1.  1. 

1. 

endData 
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APPENDIX  B 
Classes  for  Neural  Network  Simulation  Systems 


//  // 

//  Header  file  for  nnet.cc  // 

//  // 

//  Neural  Network  Simulation  System  // 

//  by  // 

//  // 

//        Zaiyong  Tang  // 

//         351  Business  Building  // 

//        Dept.  of  Decision  and  Information  Sciences  // 

//        College  of  Business  Administration  // 

//        University  of  Florida  // 

//        Gainesville,  Florida  32611  // 

//  // 

//         Phone:  904/392-9600  (0)  // 

//  904/334-5430  (H)  // 

"  // 

//        Internet:   ZT«cis.ufl.edu  // 

//  Tang0math.ufl.edu  // 

//  // 

//  // 

//  NNET  consists  of  a  set  of  general  classes  that  can  be  used  to  build  // 

//  specif ical  neural  net  paradigms.  Basic  classes  include  Link,  Node,  // 

//  Structure  etc..  A  generic  neural  net  class,  NeuralNet,  is  build  upon  // 

//  the  basic  classes.  Other  specif ical  neural  net  classes  are  derived  // 

//  from  NeuralNet.  Some  major  neural  net  sub-classes,  such  as  the  FFnet  // 

//  (feed-forward  neural  network)  can  be  used  as  parent  classes  from  which  // 

//  more  algorithmic  based  neural  net  classes  inhire  the  structure  and/or  // 

//  methods  (functions).  // 

"  // 

//  A  separate  class  Interface  is  designed  to  provide  run  time  control  and  // 

//  access  to  the  neural  net  parameters.  Interface  should  be  a  friend  to  all  // 

//  the  neural  net  classes  that  may  be  used  to  implement  a  particular  neural  // 

//  net  paradigm  and/or  learning  algorithms .  // 

//  // 

//  The  design  of  this  neural  net  simulation  system  has  benefited  from  // 

//  Larry  O'Brien's  NEURAL. CPP,  a  C++  three-layer  feed-forward  backprop  // 

//  simulation  program,  (C)  1990,  Miller  Freeman  Publications  and  from  // 

//  R.  Scott  Crowder's  cascorl.c,  a  port  to  C  from  the  original  Common  Lisp  // 

//  implementation  written  by  Scott  E.  Fahlman.   (Version  dated  Jan-25-91)  // 

//  Professor  Gary  Koehler,  my  Ph.D.  advisor,  gave  me  a  timely  and  effective  // 

//  kickstart  on  neural  net  programing  in  C++.  His  advise  and  illustrative  // 

//  backpropagation  program  with  linked  neural  node  structure  have  been  // 
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//  a  great  help  to  this  project.  // 

//  // 

//  The  Class  and  Function  names  are  small  letters  started  with  capital  // 

//  letter.  CONSTANT  names  are  all  captial  and  variable  names  are  start  with  // 

//  small  letter.  Function  arguments  use  generally  the  same  name  as  variables  // 

//  but  start  with  a  capital  letter.  Names  are  made  sell-suggestive  by  using  // 

//  mnemonic  words  and  word  combinations.  // 

//  // 


//  Change  log:  // 

//  // 

//  8/27/91  version  1.0  // 

//  // 

#ifndef  NNET.H 
#define  NNET.H 

//===================== 

//  external  definitions 
//===================== 

#include  <stream.h> 

#include  <stdlib.h> 

#include  <stdio.h> 

#include  <float.h> 

#include  <math.h> 

# include  <string.h> 

#include  <time.h> 

#include  <signal.h> 

//===================== 

//  general  symbols 
//===================== 

enum  BOOL  { 
FALSE, 
TRUE 

>; 

#define  real  float  //may  change  to  double  if  needed 

#define  OUTLAYER  (numLayer  -  1)  //for  FNN  only 

#define  INLAYER  0 

#define  qUIT  -99  //abort  the  program 

#define  LINELEN  80  //line  legth  used  in  string  handling 

#define  EOL  '\0'  //end  of  file 
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//===================== 

//  switches  used  in  the  interlace  routines 
//===================== 

enum  ENUM  {  //used  as  index  in  parameter  table 

//for  interactive  parameter  change 
INT,  //parameter  type 

REAL,  //parameter  type 

ENUME,  //parameter  type 

BOOLE,  //parameter  type 

INT_N0,  //  parameters  only  good  in  netfile 

FL0AT_H0,  //  most  are  used  in  memory  allocation 

ENUME_N0,  //  and  cannot  be  changed  mid-simulation 
BOOLE.NO , 


GETTRAINING, 
GETTEST, 

GETTRAININGFILE, 
GETTESTFILE, 

//special  control 

VALUE, 

SAVE, 

GO, 

QUIT, 

INITFILE 

>; 


//get  training  pattern  from  net  file 

//get  test  pattern  from  net  file 

//get  training  pattern  from  separate  training  file 

//get  test  pattern  from  separate  training  file 


//list  value(s)  of  some  or  all  parameters 
//interactive  control,  save  now? 
//continues 
//exit  the  program 
//initiate  files  for  output 


//  switch  constants 
//================== 


enum  NODE.TYPE  { 
SIGMOID, 
GAUSSIAN, 
LINEAR, 
ASYMSIGMOID, 
VARSIGMOID 

}; 


//node  output  =  actvationFun(net_input) 
//sigmoid  activation  function  in  (-.5,  .5) 
//Gaussian  activation  function  in  (0,1] 
//linear  activation  function  f (x)  =  x  (unbounded) 
//sigmoid  activation  function  in  (0,  1) 
//sigmoid  activation  function  in  (min,  max) 


enum  TRANS_TYPE  { 

LINEARLINK, 

BINARYLINK, 

SECONDORDER, 

THIRDORDER, 

OTHER 

>; 


//we  use  the  convention  that  net_input(x)  =  transFun(x) 

//transFun:  R'n  ->  R 

//linear  coefficient:  net_input(x)  =  <w,x> 

//binary  coefficient:  net_input(x)  =  <w,x> 

//net_input(x)  =  polynomial  of  x"2 

//net_input(x)  =  polynomial  of  x"3 

//to  be  defined 


enum  trainingStatus  { 
SUCCESS  =  10,        //training  exit  with  error  =<  allowedError 
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FAILURE , 

STAGNANT, 

TIMEOUT, 

>; 


//training  exit  with  error  >  allowedError 


enum  weightUpdating  { 
EPOCH  =  20, 
PATTERN , 
SUBPATTERN 


}; 


//epoch  training,  or  batch,  sample  training 
//pattern  training,  or  interactive,  instance  training 
//lor  multiple  output  network,  this  training  scheme 
//modifies  the  weights  alter  each  component  of  the 
//ouput  is  obtained 


enum  criterionFunctions  { 
TSS  =  30, 
CEBY, 
BITS, 
INDEX , 
OTHER 

>; 


//total  sum  of  squared  error 
//Cebychef  norm 


//to  be  defined 


enum  errorStatus  { 
FATAL, 
WARN 

}; 


//Class  Link 


class  Link 

{ 

public: 
BOOL 
real 
real 
real 
real 
real 
public: 
Link 

real 
real 
real 
real 
void 
void 
void 


connected; 

learningRate; 

mometum; 

weight; 

gradient; 

deltaWeight ; 

(void) ; 


//learning  step-size  in  weight  change 
//coefficient  of  past  weight  change 
//connection  strength 
//partial  derivative  w.r.t.  weight 
//changes  of  weight 


//constructor 

//no  destructor  since  no  memory  assignment 

{return  learningRate;  } 

{return  mometum;  } 

{return  weight;  } 

{return  deltaWeight;  } 
SetLearningRate(real  LearningRate)  {learningRate  =  LearningRate;  } 
Set Mometum! real  Momentum)  {mometum  =  Momentum;  } 

SetWeight(real  Weight)  {weight  =  Weigh;  } 


LearningRate (void) 
Mometum(void) 
Weight (void) 
DeltaWeight(void) 
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}; 


void    SetGradient(real  Gradient) 

void    SetDeltaWeight(real  DeltaWeight) 

void    SetLinkFunction(int  LinkFunction) 


{gradient  =  Gradient;  > 
{deltaWeight  =  DeltaWeigh  ;  } 
{linkFunction  =  LinkFunction;  } 


//class 


Node 


class  Node 

{ 

public: 
NODE.TYPE 
TRANS.TYPE 


nodeType  =  SIGMOID; 
transType  =  LINEAR; 


int  numln; 

int  numOut ; 

real  netlnput; 

real  output ; 

real  bias ; 

real  deltaBias ; 

real  netlnputScale; 

BOOL  updated; 


Node 

**  toInNode; 

Node 

**  toOutNode; 

Link 

**  inLinks; 

public: 

Node 

(void) ; 

"Node 

(void) ; 

real 

Numln  (void) 

real 

NumOut  (void) 

real 

Netlnput (void) 

real 

Output (void) 

real 

Bias(void) 

real 

DeltaBias (void) 

real 

ComputeActivati 

//default  to  sysmetric  sigmoid 
//default  to  linear  transfer  fun 
//number  of  inputs  to  the  node 
//number  of  outputs  to  the  node 
//net  input  after  transfer 
//node  output  value 

//used  to  shift  the  activation  function 

//changes  in  bias 

//scale  factor  for  net  input 

//New  bias  and  link  weights  are  obtained 

//Note  that  the  control  mechanism  is  node 

//based,  a  node  change  also  includes  the 

//changes  in  its  incoming  links 

//Points  to  all  Nodes  coming  in 

//Points  to  all  Nodes  going  out 

//Points  to  all  incoming  Links 


//constructor 
//destructor 
{return  numln;} 
{return  numln;} 
{return  netlnput;} 
{return  output;} 
{return  bias;} 
{return  deltaBias;} 
(real  Netln) 

//use  ComputeActivationO  for  hidden  unit  output 
//since  output  a  unit  may  have  an  activation  function 
//(say,  linear)  different  from  that  of  a  hidden  unit 

real   ComputeOutput(real  Netln) 
real   OutputPrime(real  Value,  real  Netln) 
real   ActivationPrime(real  Value,  real  Netln) 
//the  derivative  of  activation  function 


int    IncreaseNumln(void) ;  {return  numln++;} 
int    IncreaseNumOut(void) ;  {return  numOut++;> 
int    DecreaseNumln(void) ;  {return  numln — ;} 
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int  DecreaseNumOut(void) ;  {return  numOut — ;} 

real  *GetLearningRates(void) ; 

real  *GetWeights(void) ; 

real  *GetGradients(void) ; 

void  VoidNode(void) ;  //make  the  node  defunct 

void  SetNodeType(int  NodeType)      {nodeType  =  NodeType;} 

void  SetTransType(int  NodeType)     {transType  =  TransType;} 

void  SetBias(real  Bias)  {bias  =  Bias  ;} 

void  SetActivation(ActValue)       {output  =  ActValue;} 

void  SetDeltaBias(real  =  DeltaBiasHdeltaBias  =  DeltaBias  ;} 

void  SetWeights(real  *Weight) ; 

void  SetGradient(real  *Gradient); 

void  SetLearningRates(real  *LearningRate) ; 

void   ResetLink(void) 

//reset  all  the  links  coming  to  the  node  to  random  values 
void   ResetLink(int  LinkNum) 

//reset  the  link(indexed  LinkNum)  of  the  node  to  random  values 
void   ResetLink(int  LinkNum,  Link*  NewLink) 

//reset  the  link( indexed  LinkNum)  to  the  value  of  NewLink 

void   SetConnection(Node  *  ); 

//set  the  connection  of  this  node  with  another  node 
Node*  operator  =  (  NodeA  NewNode)  {}  \\to  be  added 

//node  copy  initializer 
Node*  operator  <=  (  Node*  InNode)  {SetConnection(SInNode) ;  return  *this;} 

//connect  two  nodes  with  operator  <=,  the  address  of  the  InNode  is 

//passed  to  SetConnectionO .  *this  is  returned  to  make  multiple  link 

//such  as  A  <=  B  <=  C  convenient. 


>; 


struct  NetDef inition 

{ 

//  public:  by  default 


//net  definition  is  passed  to  StructureQ  to 
//build  a  network 


int  numLayer;  //number  of  layers 

int  numNode;  //number  of  nodes 

int  *numNodeInLayer;    //number  of  nodes  in  each  layer 

int  **connection; 

//the  connection  specified  by  an  array  of  connection  specifiers: 
//(startNodeFrom,  endNodeFrom,  startNodeTo,  endNodeTo  ) 
int  numConnect Group; 

//number  of  connection  specifiers 


//class  of  neural  net  structure 
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//The  constructor  of  the  Structure  class  build  automatically  the  topological 
//connections  of  the  neural  net.  This  structure  is  inhered  by  the  objects 
//of  class  NeuralNet.  The  structure  of  a  neural  net  can  be  changed  dynamically 
//through  the  methods  of  adding  and/or  deleting  nodes  and/or  links  from  the 
//netwoek. 

class  Structure 
{ 

protected: 

int   numLayer,  numNode,  numlnNode,  numOutNode; 
int   **connection; 
int   *numNodeInLayer ; 
int   *inNode,  ♦outNode; 

//redundant,  but  kept  for  clarity 
int   **nodeArray; 

//node  index  array 
Node  *nodeSet; 

//list  of  all  the  nodes  in  the  Structure 
public: 

Structure (NetDef inition*  NetDef ) ; 
//build  network  topology 
Structure  (Structure  ft  NewStruct){\\to  be  added}; 

//copy  a  network  structure 
"Structure(void) ; 

int  NumNode (void)  {return  numNode;} 

int  NumLayer (void)  {return  numLayer;} 

int  NumlnNode(void)  {return  numlnNode;} 

int  NumOutNode (void)  {return  numOutNode;} 

int   Numln(int  Layer,  int  Node)   {return  nodeSet [Layer] [Node] .Numln() ;} 
int   NumOut( int  Layer,  int  Node)  {return  nodeSet [Layer] [Node] .NumOutQ ;} 
int   NumLayerInNode(int  Layer)   {return  numNodelnLayer [Layer] ;} 

void   SetConnections(int  *  Connect); 

//Used  in  constructor  to  set  up  connections  from  a  set  of  nodes 
//to  another  set  of  nodes,  where  the  from-to  relation  is  specified 
//by  the  integer  array  Connect 

void   AddNode(int  Layer,  int  *InList, 
int  inListLength,  int  *OutList,  outListLength) ; 

//add  a  new  node  to  the  layer  Layer  and  place  it  at  the  end  of 
//the  node  list  of  that  layer,  *InList  and  *0utList  specify  the 
//nodes  that  should  be  connected  to  and  from  the  new  node 

void   DeleteNode(int  Layer,  int  Node); 

//delete  the  node  indexed  Node  in  Layer 

void   AddLink(int  Nodel,  int  Node2); 

//add  a  new  link  from  Nodel  to  Node2 

void   AddLink(int  Nodel,  int  Node2,  Link*  NewLink) ; 

//add  NewLink(with  given  values)  from  Nodel  to  Node2 

//  void   DeleteLink(int  LinkNum,  int  Nodel,  int  OutNum,  int  Node2); 

void   DeleteLink(int  LinkNum,  int  Node) 
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//delete  a  link  from  a  non  input  node 

>; 


//Pattern  class 

class  Pattern 
{ 

public: 

BOOL  GetMem(int  InSize,  int  Outsize); 

real  *inPattern,  *targetPattern; 

>; 


//ParameterEntry  class,  default  public 
//================================================= 

struct  ParameterEntry 

{ 

//for  parameters  in  the  scope  of  NeuralNet  class.  A  parameter  table 

//is  set  up  so  that  parameter  can  be  changed  run  time  through  key 

//word  binding 

char  *keyword;        //  variable  name  in  lower  case 

int   varType;        //  can  be  INT,  FLOAT,  or  ENUM 

void  varPointer;      //  cast  to  correct  type  before  use 

>; 


//Interface  class 
//================================================= 

//User  interface  class,  Interface,  as  friend  of  NeuralNet  or  the  derived 
//class  of  it  so  that  all  their  members  are  accessible  to  Interface.  Most 
//utilitis  adopted  from  cascorl.c  by  Scott  Crowder 


class  Interface 
{ 
public: 

Int erf ace (void) 

"Int erf ace (void) 

int   FindKey(char  *SearchKey); 
int   ProcessLine(char  *Line); 
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}; 


int  TypeConvert(chax  * input); 

void  PrintValue(int  Index); 

void  List AllValues (void) ; 

void  PromptForValue(int  Index); 

void  GetTrainningFile(char  *Infile); 

void  GetTrainningDate(void) ; 

void  GetTestingFile(char  *Infile) ; 

void  GetTestingDate(void) ; 

void  SetLowerCase(char  *KeyWord) ; 

void  InteractiveUpdating(void) ; 

void  Checklnterrupt(void) ; 

void  Traplnterrupt(int  Sig) ; 

BOOL  YesOrNoPrompt(char  *Prompt) 

char  *TypeStringConvert(int  Var) ; 

char  *BoolStringConvert(int  Var) 


//generic  artificial  neural  network  class 
//================================================= 

class  KeuralNet  :  public  Structure 
{ 

friend  class  Interface; 
protected: 

int   weight sin; 

//determine  if  weights  are  read  in  from  a  weight  file, 

//useful  for  testing  hold  out  sample 
int   weightsOut ; 

//whether  weights  need  to  be  saved 
real  learningRate; 
real  momemtum; 
real  gainFactor; 
real  *patternError ; 
real  totalError; 
real  stopError; 
unsigned  iteration; 
unsigned  maxlteration; 
public: 

HeuralMet(NetDefinition  *  KewDef); 

//pass  the  net  difinition  to  the  Structure  constructor 

//and  inhires  the  network  topology  from  Structure 
HeuralNet(NeuralNet  ft  NewNet)  {\\to  be  added;  > 

//copy  a  new  network  structure 
"HeuralNet(void) ; 

//destructor 


BOOL  BuildNet(void) ; 

//read  in  network  specifications  and  then  set  up  the  network 

//and  initialize  all  parameters  that  are  not  default  to  the  net. 
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real 

real 

real 

real 
real 
real 
real 
void 
void 

void 
void 
void 
void 
void 
void 
void 
void 
void 
BOOL 
BOOL 
void 
void 
void 


}; 


Input (int  Layer,  int  Node,  int  Inlndex) 

{return  nodeSet [Layer]  [Node] . inLinks [Inlndex]  ->  Output ();} 
Net Input (int  Layer,  int  Node) 

■[return  nodeSet  [Layer]  [Node]  .NetlnputO  ;} 
Output (int  Layer,  int  Node); 

{return  nodeSet [Layer] [Node] . Output ();> 
*LayerOutputs(int  Layer); 
♦NetworkOutputs(void) ; 
*LearningRates(int  Layer,  int  Node); 
*Weights(int  Layer,  int  Node); 
RandomWeights(void) ; 
SetActivation(int  Layer,  int  Node,  real  ActValue) 

{nodeSet [Layer] [Node] . SetActivation(ActValue) ;> 
SetNetworkInputs(real  *Pattern) ; 
SetWeights(int  Layer,  int  Node,  real  *Weight); 
SetLearningRates(int  Layer,  int  Node,  real  *LRate); 
SetMomentum(int  Layer,  int  Node,  real  *Moment); 
Propagate(void) ; 
PropagateLayer(int  Layer); 
PropagateNode(int  Layer,  int  Node); 
GetNetInput(int  Layer,  int  Node); 
GetOutput(int  Layer,  int  Node); 
inputWeight(void)  {return  weightsln;} 
outputWeight(void)  {return  weightsOut;} 
readWeights(void) ; 
showWeights(void) ; 
writeWeights(void) ; 


//  backprop  neural  net  class 


class  BackpropNet:  public  NeuralNet 
{ 

friend  class  Interface; 
protected: 

real  numEpoch 

real  *patternError; 

real  totalError; 

real  errorLirait ; 

real  **delta; 

//delta  is  the  partial  derivative  of  the  global  criterion  function 

//w.r.t.  the  bias  term  in  each  node.  It  is  sometimes  referred  to, 

//incorrectly,  as  the  error  of  a  node, 
public: 

BackpropNet (NetDef inition  *NetDef)  :  (NetDef);   //constructor 

"BackpropNet (void) ;  //destructor 
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void  ComputeOutput(int  Layer,  int  Node); 

void  ComputeLayerNet Inputs (int  Layer); 

void  ComputeLayerOutputs(int  Layer); 

void  ForwardPropagate(void) ; 

real  SumSquaredError(real  *DesiredOutput) ; 

void  ComputeLayerDeltas(int  Layer,  real  *DesiredOutput) ; 

void  BackPropagate(real  *DesiredOutput) ; 

real  **LayerWeights(int  Layer); 

real  Delta(int  Layer,  int  Node); 

real  ActivationPrimeO 

real  OutputPrimeO 

void  QuickPropUpdateO 

real  *LayerDeltas (int  Layer) ; 

void  UpdateLayerWeights(int  Layer); 

void  UpdateNetworkWeights(void) ; 

BOOL  Train(Pattern  **  TrainPat); 


>; 
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